Method for diagnosing lung cancers using gene expression profiles in peripheral blood mononuclear cells

ABSTRACT

Methods and compositions are provided for diagnosing lung cancer in a mammalian subject by use of three or more selected gene, e.g., a gene expression profile, from the peripheral blood mononuclear cells (PBMC) of the subject which is characteristic of disease, a stage of the disease, or enables prognosis of recurrence of disease. The gene expression profile includes three or more genes of Table I, Table II, Table III, Table IV, Table V, Table VI or Table VII herein. Detection of changes in expression in the selected genes forming the gene expression profile from that of a reference gene expression profile are correlated with non-small cell lung cancer (NSCLC). One composition for use in such diagnosis includes three or more PCR primer-probe sets, wherein each primer-probe set amplifies a different polynucleotide sequence from the gene expression profile. Another composition for similar use contains a plurality of polynucleotides immobilized on a substrate, which probes hybridize to three or more gene expression products from genes in the gene expression profile. Still another composition involves detection of the protein expression products of genes from the gene expression profile.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of pending U.S. patent applicationSer. No. 13/914,902, filed Jun. 11, 2013, which is a continuation ofU.S. patent application Ser. No. 12/745,991, filed Jun. 3, 2010, nowU.S. Pat. No. 8,476,420, which is a 371 of International PatentApplication No. PCT/US2008/013450, filed Dec. 5, 2008, now expired,which claimed the benefit of the priority of U.S. Provisional PatentApplication No. 61/005,569, filed Dec. 5, 2007, now expired. Allpriority applications are incorporated herein by reference in theirentireties.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under Grant No. R01CA125749 awarded by the National Institutes of Health. The governmenthas certain rights in this invention.

BACKGROUND OF THE INVENTION

Lung cancer is the most common worldwide cause of cancer mortality. Inthe United States, lung cancer is the second most prevalent cancer inboth men and women and will account for more than 174,000 new cases peryear and more than 162,000 cancer deaths. In fact, lung cancer accountsfor more deaths each year than from breast, prostate and colorectalcancers combined².

The high mortality (80-85% in five years), which has shown little or noimprovement in the past 30 years, emphasizes the fact that new andeffective tools to facilitate early diagnoses prior to metastasis toregional nodes or beyond the lung are needed⁶.

High risk populations include smokers, former smokers, and individualswith markers associated with genetic predispositions⁹¹⁻⁹³. Becausesurgical removal of early stage tumors remains the most effectivetreatment for lung cancer, there has been great interest in screeninghigh-risk patients with low dose spiral CT (LDCT)^(12,14,15,94). Thisstrategy identifies non-calcified pulmonary nodules in approximately30-70% of high risk individuals but only a small proportion of detectednodules are ultimately diagnosed as lung cancers (0.4 to2.7%)^(16,95,96). Currently, the only way to differentiate subjects withlung nodules of benign etiology from subjects with malignant nodules isan invasive biopsy, surgery, or prolonged observation with repeatedscanning Even using the best clinical algorithms 20-55% of patientsselected to undergo surgical lung biopsy for indeterminate lung nodules,are found to have benign disease's and those that do not undergoimmediate biopsy or resection require sequential imaging studies. Theuse of serial CT in this group of patients runs the risk of delayingpotential curable therapy, along with the costs of repeat scans, thenot-insignificant radiation doses, and the anxiety of the patient.

Ideally, a diagnostic test would be easily accessible, inexpensive,demonstrate high sensitivity and specificity, and result in improvedpatient outcomes (medically and financially). Efforts are in progress todevelop non-invasive diagnostics using sputum, blood or serum andanalyzing for products of tumor cells, methylated tumor DNA^(7,8),single nucleotide polymorphism (SNPs)⁹ expressed messenger RNA¹⁰ orproteins¹¹. This broad array of molecular tests with potential utilityfor early diagnosis of lung cancer has been discussed in the literature.Although each of these approaches has its own merits, none has yetpassed the exploratory stage in the effort to detect patients with earlystage lung cancer, even in high-risk groups, or patients which have apreliminary diagnosis based on radiological and other clinicalfactors¹². A simple blood test, a routine event associated with regularclinical office visits, would be an ideal diagnostic test.

One established method to achieve the goal of genetic diagnosis has beenthe use of microarray signatures from tumor tissue²⁰. This approach hasbeen tested and validated by numerous investigators⁸⁹. An increasingnumber of studies have shown that peripheral blood mononuclear cells(PBMC) profiles can be used to diagnose and classify systemic diseases,including cancer, and to monitor therapeutic response.²¹ The validity ofusing PBMC profiles in patients with cancer has been previously reportedin the use of microarrays to compare PBMC from patients with late stagerenal cell carcinoma compared to normal controls^(20,42). A more recentpublication⁴³ describes the development of a 37 gene classifier fordetecting early breast cancer from peripheral blood samples with 82%accuracy. Another study identified gene expression profiles in the PBMCof colorectal cancer patients that could be correlated with response totherapy⁴⁴. Some of the present inventors previously suggested²² thatchemokines and cytokines released by malignant cells could impose atumor specific signature on immune cells of patients withnon-hematopoietic cancers. Gene expression profiles have now beengenerated from PBMC that identify blood signatures associated with avariety of cancers, including metastatic melanoma²³, breast²⁴,renal^(25,26) and bladder cancer²⁷. Most of these studies focused onlate stage cancers or response to therapy and used younger healthycontrol groups for comparison.

While the effect of chronic obstruction pulmonary disease (COPD) on PBMCgene expression is relatively unstudied to date, there are some limitedreports about the effect of cigarette smoke³³. Exposure of peripheralblood lymphocytes (PBL) ex vivo to cigarette smoke induced many changesin gene expression³⁴. Changes could be detected in the transcriptosomeof blood neutrophils in COPD patients versus normals³⁵. One studydistinguished “between 85 individuals exposed and unexposed to tobaccosmoke on the basis of mRNA expression in peripheral leukocytes”³⁶. Nodata is apparently available regarding similar changes in blood that maybe present in former-smokers. Gene expression in airway epithelia ofsmokers, ex-smokers and non-smokers has been compared³⁷. Although manyclinical manifestations of smoking rapidly returned to normal aftersmoking cessation, there was a subset of genes whose expression remainedaltered. Differential gene hypermethylation³⁸ and dysregulatedmacrophage cytokine production³³ have also been linked to cigarettesmoke. However, to date, there are no reports of gene expression profileor signature useful in the diagnosis of lung cancer.

Despite recent advances, the challenge of cancer treatment remains totarget specific treatment regimens to pathogenically distinct tumortypes, and ultimately personalize tumor treatment in order to maximizeoutcome. Hence, a need exists for tests that simultaneously providepredictive information about patient responses to the variety oftreatment options. In particular, once a patient is diagnosed withcancer, there is a strong need for methods that allow the physician topredict the expected course of disease, including the likelihood ofcancer recurrence, long-term survival of the patient, and the like, andselect the most appropriate treatment option accordingly. There alsoremains a need in the art for a less invasive diagnostic test that couldmore accurately determine the risk of malignant disease in patients withlung nodules and would reduce unnecessary surgery, biopsies, PET scans,and/or repeated CT scans.

SUMMARY OF THE INVENTION

In one aspect, a composition for diagnosing or evaluating a lung cancerin a mammalian subject includes (a) three or more polynucleotides oroligonucleotides, wherein each polynucleotide or oligonucleotidehybridizes to a different gene, gene fragment, gene transcript orexpression product from mammalian peripheral blood mononuclear cells(PBMC), or (b) three or more ligands, wherein each ligand binds to adifferent gene expression product from mammalian peripheral bloodmononuclear cells (PBMC). Each gene, gene fragment, gene transcript orexpression product is selected from (i) the genes of Table I; (ii) thegenes of Table II; (iii) the genes of Table III; (iv) the genes of TableIV, or (v) a combination of genes from more than one of these Tables.

Thus, in one embodiment, a composition for diagnosing or evaluating lungcancer in a mammalian subject includes three or more PCR primer-probesets, wherein each primer-probe set amplifies a different polynucleotideor oligonucleotide sequence from a gene expression product of three ormore informative genes selected from a gene expression profile in theperipheral blood mononuclear cells (PBMC) of the subject. The geneexpression profile includes three or more genes of Table I or Table IIor Table III or Table IV or a combination thereof. This compositionenables amplification of genes in the gene expression profile anddetection of changes in expression in the genes in the subject's geneexpression profile from that of a reference gene expression profile. Thevarious reference gene expression profiles are described below. Suchchanges correlate with a lung cancer, such as a non-small cell lungcancer (NSCLC).

Thus, in another aspect, a composition for diagnosing or evaluating alung cancer in a mammalian subject is composed of a plurality ofpolynucleotides or oligonucleotides immobilized on a substrate. Theplurality of genomic probes hybridizes to three or more gene expressionproducts of three or more informative genes selected from a geneexpression profile in the PBMC of the subject. The gene expressionprofile includes three or more genes of Table I or Table II or Table IIIor Table IV or a combination thereof. This composition enables detectionof changes in expression in said genes in said gene expression profilefrom that of a reference gene expression profile, said changescorrelated with a diagnosis, prognosis or evaluation of a lung cancer,e.g., NSCLC.

Thus, in another embodiment, a composition or kit for diagnosing orevaluating a lung cancer in a mammalian subject includes a plurality ofligands that bind to three or more gene expression products of three ormore informative genes selected from a gene expression profile in thePBMC of the subject. The gene expression profile includes three or moregenes of Table I or Table II or Table III or Table IV or a combinationthereof. This composition enables detection of changes in expression insaid genes in said gene expression profile from that of a reference geneexpression profile, said changes correlated with a lung cancer, such asNSCLC.

Thus, in still another embodiment, a composition for diagnosing orevaluating a lung cancer in a mammalian subject includes a plurality ofgene expression products of three or more informative genes selectedfrom a gene expression profile in the PBMC of the subject immobilized ona substrate for detection or quantification of antibodies in the PBMC ofthe subject. The gene expression profile comprises three or more genesof Table I or Table II or Table III or Table VII or a combinationthereof. This composition enables detection of changes in expression inthe genes in the gene expression profile from that of a reference geneexpression profile, said changes correlated with a diagnosis orevaluation of a lung cancer, such as NSCLC.

In another aspect, any of the compositions described above employpolynucleotides, oligonucleotides, or ligands that hybridize, amplify orbind to the genes or products of the informative genes from Table I thatinclude three or more genes selected from the group consisting of IGSF6,HSPA8(A), LYN, DNCL1, HSPA1A, DPYSL2, HAGK, HSPA8(I), NFKBIA, FGL2,CALM2, CCL5, RPS2, DDIT4 and C1orf63.

In still a further aspect, any of the compositions described aboveemploy polynucleotides, oligonucleotides, or ligands that hybridize,amplify or bind to the genes or products of the informative genes fromTable II that include three or more genes selected from the groupconsisting of ETS1, CCL5, DDIT4, CXCR4, DNCL1, MS4ABA, ATP5B, HSPA8(A),ADM PTPN6, ARHGAP9, S100A8, DPYSL2, HSPA1A, and NFKBIA.

In another aspect, any of the compositions described above employpolynucleotides, oligonucleotides, or ligands that hybridize, amplify orbind to the genes or products of the informative genes from Table IIIthat include three or more genes selected from the group consisting ofTSC22D3, CXCR4, DNCL1, RPS3, DDIT4, GAMB, BTG1, HSPA8(I), RPL12, SLA,RUNX3, MGC17330, HSPA1A, IL18RAP and CIRBP.

In another aspect, a method for diagnosing or evaluating a lung cancerin a mammalian subject involves identifying changes in the expression ofthree or more genes from the peripheral blood mononuclear cells (PBMC)of a subject, said genes selected from (a) the genes of Table I; (b) thegenes of Table II; (c) the genes of Table III; or (d) the genes of TableIV; or (v) a combination thereof, and comparing that subject's geneexpression levels with the levels of the same genes in a reference orcontrol, wherein changes in expression of said gene expressioncorrelates with a diagnosis or evaluation of a lung cancer. In oneembodiment, the lung cancer is a NSCLC.

In another aspect, a method for diagnosing or evaluating a lung cancerin a mammalian subject involves identifying a gene expression profile inthe PBMC of a subject, the gene expression profile comprising three ormore gene expression products of three or more informative genes havingincreased or decreased expression in lung cancer. The three or moreinformative genes are selected from the genes of Table I or Table II orTable III or Table IV or a combination thereof. The subject's geneexpression profile is compared with a reference gene expression profilefrom a variety of sources described below. Changes in expression of theinformative genes correlate with a diagnosis or evaluation of a lungcancer, e.g., NSCLC.

In still a further aspect, a method of predicting the likelihood ofrecurrence or evaluating the progression, regression or other responseof a lung cancer to therapy in a mammalian subject is provided. Thismethod includes identifying a gene expression profile in the PBMC of asubject after solid tumor resection or chemotherapy. The gene expressionprofile comprises three or more gene expression products of three ormore informative genes from the above noted tables, particularly TableIII. The subject's post-surgical or post-therapeutic gene expressionprofile is then compared with said subject's pre-surgical orpre-therapeutic gene expression profile. Changes in expression of theinformative genes correlate with a decreased likelihood of recurrence, arecurrence of cancer, a regression of cancer or some othertherapy-related response. In another aspect of this method, a geneexpression profile indicative of low recurrence post-surgery orpost-therapy is identifiable in the PBMC of a subject that has abackground of smoking and/or has COPD.

In another aspect, a novel method for selecting significant genes incomparative gene expression studies is provided. This novel method,i.e., SVM-RCE, combines K-means and Support Vector Machines (SVMs) toidentify and score (rank) those gene clusters for the purpose ofclassification by (i) initially using K-means to group genes intoclusters; and (ii) using recursive cluster elimination (RCE) toiteratively remove those clusters of genes that contribute the least tothe classification performance.

In yet a further aspect, a composition for diagnosing or evaluating alung cancer in a mammalian subject is provided. This compositionincludes (a) three or more polynucleotides or oligonucleotides, whereineach polynucleotide or oligonucleotide hybridizes to a different gene,gene fragment, gene transcript or expression product from mammalianperipheral blood mononuclear cells (PBMC), or (b) three or more ligands,wherein each ligand binds to a different gene expression product frommammalian peripheral blood mononuclear cells (PBMC). The gene, genefragment, gene transcript or expression product is selected from among(i) the genes of Table V; (ii) the genes of Table VI; (iii) the genes ofTable VII, or (iv) genes from a combination of these Tables.

In one embodiment, the composition includes polynucleotides oroligonucleotides that hybridize to, or ligands that bind the expressionproducts of, the first 29 genes of Table V (hereinafter referred to as“the 29 gene classifier”) or a subset thereof. This embodiment isparticularly useful for diagnosis of a lung cancer, such as a NSCLC, anddistinguishing between subjects with cancer and subjects with non-cancerlung disease.

In another embodiment, the composition includes polynucleotides oroligonucleotides that hybridize to, or ligands that bind the expressionproducts of, the first four genes of Table VI, or a subset thereof. Thisembodiment is particularly useful for determining the prognosis ofpost-surgical lung cancer subjects.

In another embodiment, the composition includes polynucleotides oroligonucleotides that hybridize to, or ligands that bind the expressionproducts of, the 24 genes of Table VII, or a subset thereof. Thisembodiment is particularly useful for diagnosis of a lung cancer anddistinguishing between subjects with cancer and subjects with benignlung nodules.

In still another aspect, a method for diagnosing or evaluating lungcancer in a mammalian subject comprising identifying changes in theexpression of three or more genes from the peripheral blood mononuclearcells (PBMC) of said subject. The genes are selected from (a) the genesof Table V; (b) the genes of Table VI; (c) the genes of Table VII, and(d) the genes from a combination of these tables. The subject's geneexpression levels of the selected genes or gene signature are comparedwith the levels of the same genes or profile in a reference or control.Changes in expression of these genes between the subject and the controlcorrelates with a diagnosis or prognosis of a lung cancer, or anevaluation of recurrence or other response to therapy.

Other aspects and advantages of these compositions and methods aredescribed further in the following detailed description of the preferredembodiments thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a bar graph showing the SVM classification scores for 44 earlystage adenocarcinoma (AC T1T2) patient samples (dark bars) and 52non-healthy controls (NHC, indicated by lighter bars) using 15 genesselected by SVM-RFE. See the 15 genes of Table IV, column labeled“AC/NHC”. SVM-scores are calculated as an average across all SVM-scoresassigned to a sample when it is in a test set during cross-validation.Each column represents one sample. Error bars represent the standarddeviation of the classifications over the 100 resamplings. The ROC curvefor the 15 gene classifier performance produced an AUC=area under curveof 0.92 (curve not shown).

FIG. 2 is a bar graph showing the SVM Classification of combined AC+LSCC(NSCLC; dark bars) and NHC (lighter bars) using the 15 genes selected bySVM-RFE (Table IV, column labeled ALL/NHC). The discriminant scores forthe 77 NSCLC samples and 52 NHC samples are shown. Lighter bars withpositive scores are misclassified NHC and darker bars with negativescores are misclassified case samples. The ROC curve for the 15 geneclassifier produced an AUC of 0.897 (curve not shown).

FIG. 3 is a bar graph showing a pairwise comparison of discriminantscores for pre-surgery samples (dark bars) and post-surgery samples(light bars). The 15 genes selected by SVM-RFE (see Table IV, columnlabeled PRE/POST″) were used to assign discriminant scores to thepost-surgery samples. These scores are shown with the score for the samepatient arranged in pre-post pairs. A negative score indicates thissample is more similar to the NHC samples used to select the 15 geneclassifier.

FIG. 4 is a bar graph showing the SVM-RFE analysis of pre-surgery topost-surgery samples. The 16 pre-surgery samples (dark bars) wereindicated as the positive class and the 16 post-surgery samples (lightbars) as the negative class. SVM-RFE was carried out starting with thetop 1,000 genes identified by t-test and then reduced to 1. Theclassifier built on six genes (the top 6 genes of Table IV, columnlabeled PRE/POST, namely TSC22D3, CXCR4, DNCL1, RPS3, DDIT4, GZMB) gavean overall accuracy of 93% and these were used to generate the SVMscores. The ROC curve for the 6 gene classifier produced an AUC of 0.96(curve not shown). A discriminant score was given to each sample(positive is indicative of lung cancer; negative is indicated of nocancer). In all but two samples, the post score is lower than thepre-surgery sample. This data supports the detection of a tumor-relatedgene expression signature that diminishes after surgery. The extent ofthose changes reflects the possibility of recurrence.

FIG. 5 is a graph showing application of the 29 gene NSCLC classifier toPBMC samples taken pre- and post-surgical resection in 18 patients fromthe University of Pennsylvania.

FIG. 6 is a graph showing classification of pre- and post-surgerysamples with 4 gene classifier (CYP2R1, MYO5B, DGUOK and DNCL1) trainedby SVM-RFE with 10-fold cross-validation.

DETAILED DESCRIPTION OF THE INVENTION

The methods and compositions described herein apply gene expressiontechnology to blood screening for the detection, diagnosis, andmonitoring of response to treatment of lung cancer. The compositions andmethods described herein permit the diagnosis of a disease or its stagegenerally, and lung cancers particularly, by determining acharacteristic RNA expression profile of the genes of the peripheralblood mononuclear cells (PBMC) or peripheral blood lymphocytes (PBL) ofa mammalian, preferably human, subject. The profile is established bycomparing the profiles of numerous subjects of the same class (e.g.,patients with a certain type and stage of lung cancer, or a mixture oftypes and stages) with numerous subjects of a class from which theseindividuals must be distinguished in order to provide a usefuldiagnosis.

These methods of lung cancer screening employ compositions suitable forconducting a simple and cost-effective and non-invasive blood test usinggene expression profiling that could alert the patient and physician toobtain further studies, such as a chest radiograph or CT scan, in muchthe same way that the prostate specific antigen is used to help diagnoseand follow the progress of prostate cancer. The gene expression profilesdescribed herein provide the basis for a variety of classificationsrelated to this diagnostic problem. The application of these profilesprovides overlapping and confirmatory diagnoses of the type of lungdisease, beginning with the initial test for malignant vs. non-malignantdisease.

I. DEFINITIONS

“Patient” or “subject” as used herein means a mammalian animal,including a human, a veterinary or farm animal, a domestic animal orpet, and animals normally used for clinical research. In one embodiment,the subject of these methods and compositions is a human.

“Control” or “Control subject” as used herein refers to the source ofthe reference gene expression profiles as well as the particular panelof control subjects identified in the examples below. For example, thecontrol subject in one embodiment can be controls with lung cancer, suchas a subject who is a current or former smoker with malignant disease, asubject with a solid lung tumor prior to surgery for removal of same; asubject with a solid lung tumor following surgical removal of saidtumor; a subject with a solid lung tumor prior to therapy for same; anda subject with a solid lung tumor during or following therapy for same.In other embodiments, the controls for purposes of the compositions andmethods described herein include any of the following classes ofreference human subject with no lung cancer. Such non-healthy controls(NHC) include the classes of smoker with non-malignant disease, a formersmoker with non-malignant disease (including patients with lungnodules), a non-smoker who has chronic obstructive pulmonary disease(COPD), and a former smoker with COPD. In still other embodiments, thecontrol subject is a healthy non-smoker with no disease or a healthysmoker with no disease. In yet other embodiments, the control orreference is the same subject in which the genes or gene profile wasassessed prior to surgery, or at another earlier timepoint to enableassessment of surgical or treatment efficacy or prognosis or progressionof disease. Selection of the particular class of controls depends uponthe use to which the diagnostic/monitoring methods and compositions areto be put by the physician.

In the examples below, the selected control group, non-healthy controls,is specifically chosen to match as closely as possible the patients withmalignant disease. The match includes both smoking status andsmoking-related diseases such as COPD. All subjects of both classes wereeither current or former smokers when they presented with symptoms ofdisease. The most informative genes identified below can distinguishsmokers with malignant disease from smokers with non-malignant disease.These informative genes do not include those previously found todistinguish smokers from non-smokers, for example CYP1B1, HML2, CCR2,NRG1.³⁶

“Sample” as used herein means any biological fluid or tissue thatcontains immune cells and/or cancer cells. The most suitable sample foruse in this invention includes peripheral blood, more specificallyperipheral blood mononuclear cells. Other useful biological samplesinclude, without limitation, whole blood, saliva, urine, synovial fluid,bone marrow, cerebrospinal fluid, vaginal mucus, cervical mucus, nasalsecretions, sputum, semen, amniotic fluid, bronchoalveolar lavage fluid,and other cellular exudates from a patient having cancer. Such samplesmay further be diluted with saline, buffer or a physiologicallyacceptable diluent. Alternatively, such samples are concentrated byconventional means.

“Immune cells” as used herein means B-lymphocytes, T-lymphocytes, NKcells, macrophages, mast cells, monocytes and dendritic cells.

As used herein, the term “cancer” refers to or describe thephysiological condition in mammals that is typically characterized byunregulated cell growth. More specifically, as used herein, the term“cancer” means any lung cancer. In one embodiment, the lung cancer isnon-small cell lung cancer (NSCLC). In a more specific embodiment, thelung cancer is lung adenocarcinoma (AC or LAC). In another more specificembodiment, the lung cancer is lung squamous cell carcinoma (SCC orLSCC). In another embodiment, the lung cancer is a stage I or stage IINSCLC. In still another embodiment, the lung cancer is a mixture ofearly and late stages and types of NSCLC.

The term “tumor,” as used herein, refers to all neoplastic cell growthand proliferation, whether malignant or benign, and all pre-cancerousand cancerous cells and tissues.

By “diagnosis” or “evaluation” refers to a diagnosis of a lung cancer, adiagnosis of a stage of lung cancer, a diagnosis of a type orclassification of a lung cancer, a diagnosis or detection of arecurrence of a lung cancer, a diagnosis or detection of a regression ofa lung cancer, a prognosis of a lung cancer, or an evaluation of theresponse of a lung cancer to a surgical or non-surgical therapy.

By “change in expression” is meant an upregulation of one or moreselected genes in comparison to the reference or control; adownregulation of one or more selected genes in comparison to thereference or control; or a combination of certain upregulated genes anddown regulated genes.

By “therapeutic reagent” or “regimen” is meant any type of treatmentemployed in the treatment of cancers with or without solid tumors,including, without limitation, chemotherapeutic pharmaceuticals,biological response modifiers, radiation, diet, vitamin therapy, hormonetherapies, gene therapy, surgical resection, etc.

By “non-tumor genes” as used herein is meant genes which are normallyexpressed in other cells, preferably immune cells, of a healthy mammal,and which are not specifically products of tumor cells.

By “informative genes” as used herein is meant those genes theexpression of which changes (either in an up-regulated or down-regulatedmanner) characteristically in the presence of lung cancer. Astatistically significant number of such informative genes thus formsuitable gene expression profiles for use in the methods andcompositions.

The term “statistically significant number of genes” in the context ofthis invention differs depending on the degree of change in geneexpression observed. The degree of change in gene expression varies withthe type of cancer and with the size or spread of the cancer or solidtumor. The degree of change also varies with the immune response of theindividual and is subject to variation with each individual. Forexample, in one embodiment of this invention, a large change, e.g., 2-3fold increase or decrease in a small number of genes, e.g., in from 3 to8 characteristic genes, is statistically significant. This isparticularly true for cancers without solid tumors. In anotherembodiment, a smaller relative change in about 10, 20, 24, 29, or 30 ormore genes is statistically significant. This is particularly true forcancers with solid tumors. Still alternatively, if a single gene isprofiled as up-regulated or expressed significantly in cells whichnormally do not express the gene, such up-regulation of a single genemay alone be statistically significant. Conversely, if a single gene isprofiled as down-regulated or not expressed significantly in cells whichnormally do express the gene, such down-regulation of a single gene mayalone be statistically significant. As an example, a single gene, whichis expressed about the same in all members of a population of patients,is 4-fold down regulated in only 1% of individuals without cancer. Foursuch independently regulated genes in one individual, all 4 folddown-regulated, would occur by chance only one time in 100 millionTherefore those 4 genes are a statistically significant number of genesfor that cancer. Alternatively, if normal variance is higher, e.g., onehealthy person in 10 has the gene 4-fold down-regulated, then a largerpanel of genes is required to detect variance for a particular cancer.

Thus, the methods and compositions described herein contemplateexamination of the expression profile of a “statistically significantnumber of genes” ranging from 1 to about 100 genes in a single profile.In one embodiment, the gene profile is formed by a statisticallysignificant number of 1 or more genes. In another embodiment, the geneprofile is formed by a statistically significant number of 3 or moregenes. In still another embodiment, the gene profile is formed by 4 ormore genes. In still another embodiment, the gene profile is formed byat least 5 to 15 or more genes. In still another embodiment, the geneprofile is formed by 24 or 29 or more genes. In still other embodiments,the gene profiles examined as part of these methods, particularly incases in which the cancers are characterized by solid tumors, contain,as statistically significant numbers of genes, from 5, 10, 15, 20, 30,40, 50, 60, 70, 80, or 90 or more genes in a panel, and any numberstherebetween.

Tables I to VII below refer to collections of known genes. Tables I, IIand III include the top 100 genes in each classification identified bythe inventors as capable of forming a gene expression profile for threedistinct classifications of disease. Table I identifies the top 100genes that can be used in a gene expression profile to identify thepresence of a lung cancer, e.g., any NSCLC. Table II identifies the top100 genes that can be used in a gene expression profile to distinguishthe occurrence of a lung cancer, and in one embodiment are useful todistinguish AC from any other NSCLC. Table III identifies the top 100genes that can be used in a gene expression profile to identify thechanges consistent with post-surgical improvement of and/or themaintenance of post-surgical improvement of a lung cancer, such as anNSCLC. This latter collection of genes is also anticipated to be usefulin tracking improvement during or following therapeutic treatment of alung cancer, such as an NSCLC. Table IV shows the top 15 geneclassifiers for a gene expression profile to identify the presence of alung cancer, such as NSCLC (i.e., taken from Table I), to identify an AC(i.e., taken from Table II), and to identify the post-surgical status ofa subject (i.e., taken from Table III).

Table V identifies an additional 136 genes useful in forming geneprofiles for use in diagnosing patients with a lung cancer, such as anNSCLC, from a control, particularly non-healthy controls. The top ranked29 genes in this table are referenced as “the 29 gene classifier” inExamples 14-18 below. Table VI identifies another set of 50 genes usefulin a gene expression profile to identify the changes consistent withpost-surgical improvement of and/or the maintenance of post-surgicalimprovement of a lung cancer. Similarly these genes are useful as a genesignature to monitor cancer progression or regression in a patienttreated non-surgically for a lung cancer. Table VII identifies a set of24 genes useful in discriminating between a subject having a lungcancer, e.g., NSCLC, and subjects having benign (non-malignant) lungnodules.

The genes identified in Tables I through VII are publically available.One skilled in the art may readily reproduce the compositions andmethods described herein by use of the sequences of the genes, all ofwhich are publicly available from conventional sources, such as GenBank.

The term “microarray” refers to an ordered arrangement of hybridizablearray elements, preferably polynucleotide or oligonucleotide probes, ona substrate.

The term “polynucleotide,” when used in singular or plural form,generally refers to any polyribonucleotide or polydeoxribonucleotide,which may be unmodified RNA or DNA or modified RNA or DNA. Thus, forinstance, polynucleotides as defined herein include, without limitation,single- and double-stranded DNA, DNA including single- anddouble-stranded regions, single- and double-stranded RNA, and RNAincluding single- and double-stranded regions, hybrid moleculescomprising DNA and RNA that may be single-stranded or, more typically,double-stranded or include single- and double-stranded regions. Inaddition, the term “polynucleotide” as used herein refers totriple-stranded regions comprising RNA or DNA or both RNA and DNA. Thestrands in such regions may be from the same molecule or from differentmolecules. The regions may include all of one or more of the molecules,but more typically involve only a region of some of the molecules. Oneof the molecules of a triple-helical region often is an oligonucleotide.The term “polynucleotide” specifically includes cDNAs. The term includesDNAs (including cDNAs) and RNAs that contain one or more modified bases.Thus, DNAs or RNAs with backbones modified for stability or for otherreasons are “polynucleotides” as that term is intended herein. Moreover,DNAs or RNAs comprising unusual bases, such as inosine, or modifiedbases, such as tritiated bases, are included within the term“polynucleotides” as defined herein. In general, the term“polynucleotide” embraces all chemically, enzymatically and/ormetabolically modified forms of unmodified polynucleotides, as well asthe chemical forms of DNA and RNA characteristic of viruses and cells,including simple and complex cells.

The term “oligonucleotide” refers to a relatively short polynucleotide,including, without limitation, single-stranded deoxyribonucleotides,single- or double-stranded ribonucleotides, RNA:DNA hybrids anddouble-stranded DNAs. Oligonucleotides, such as single-stranded DNAprobe oligonucleotides, are often synthesized by chemical methods, forexample using automated oligonucleotide synthesizers that arecommercially available. However, oligonucleotides can be made by avariety of other methods, including in vitro recombinant DNA-mediatedtechniques and by expression of DNAs in cells and organisms.

The terms “differentially expressed gene,” “differential geneexpression” and their synonyms, which are used interchangeably, refer toa gene whose expression is activated to a higher or lower level in asubject suffering from a disease, specifically cancer, such as lungcancer, relative to its expression in a control subject. The terms alsoinclude genes whose expression is activated to a higher or lower levelat different stages of the same disease. It is also understood that adifferentially expressed gene may be either activated or inhibited atthe nucleic acid level or protein level, or may be subject toalternative splicing to result in a different polypeptide product. Suchdifferences may be evidenced by a change in mRNA levels, surfaceexpression, secretion or other partitioning of a polypeptide, forexample. Differential gene expression may include a comparison ofexpression between two or more genes or their gene products, or acomparison of the ratios of the expression between two or more genes ortheir gene products, or even a comparison of two differently processedproducts of the same gene, which differ between normal subjects,non-health controls and subjects suffering from a disease, specificallycancer, or between various stages of the same disease. Differentialexpression includes both quantitative, as well as qualitative,differences in the temporal or cellular expression pattern in a gene orits expression products among, for example, normal and diseased cells,or among cells which have undergone different disease events or diseasestages. For the purpose of this invention, “differential geneexpression” is considered to be present when there is a statisticallysignificant (p<0.05) difference in gene expression between the subjectand control samples.

The term “over-expression” with regard to an RNA transcript is used torefer to the level of the transcript determined by normalization to thelevel of reference mRNAs, which might be all measured transcripts in thespecimen or a particular reference set of mRNAs.

The phrase “gene amplification” refers to a process by which multiplecopies of a gene or gene fragment are formed in a particular cell orcell line. The duplicated region (a stretch of amplified DNA) is oftenreferred to as “amplicon.” Usually, the amount of the messenger RNA(mRNA) produced, i.e., the level of gene expression, also increases inthe proportion of the number of copies made of the particular geneexpressed.

The term “prognosis” is used herein to refer to the prediction of thelikelihood of cancer-attributable death or progression, includingrecurrence, metastatic spread, and drug resistance, of a neoplasticdisease, such as lung cancer. The term “prediction” is used herein torefer to the likelihood that a patient will respond either favorably orunfavorably to a drug or set of drugs, and also the extent of thoseresponses, or that a patient will survive, following surgical removal ofthe primary tumor and/or chemotherapy for a certain period of timewithout cancer recurrence. The predictive methods of the presentinvention can be used clinically to make treatment decisions by choosingthe most appropriate treatment modalities for any particular patient.The predictive methods described herein are valuable tools in predictingif a patient is likely to respond favorably to a treatment regimen, suchas surgical intervention, chemotherapy with a given drug or drugcombination, and/or radiation therapy, or whether long-term survival ofthe patient, following surgery and/or termination of chemotherapy orother treatment modalities is likely.

The term “long-term” survival is used herein to refer to survival for atleast 1 year, more preferably for at least 3 years, most preferably forat least 7 years following surgery or other treatment.

“Stringency” of hybridization reactions is readily determinable by oneof ordinary skill in the art, and generally is an empirical calculationdependent upon probe length, washing temperature, and saltconcentration. In general, longer probes require higher temperatures forproper annealing, while shorter probes need lower temperatures.Hybridization generally depends on the ability of denatured DNA toreanneal when complementary strands are present in an environment belowtheir melting temperature. The higher the degree of desired homologybetween the probe and hybridizable sequence, the higher is the relativetemperature which can be used. As a result, it follows that higherrelative temperatures would tend to make the reaction conditions morestringent, while lower temperatures less so. Various publishedtexts^(69,77) provide additional details and explanation of stringencyof hybridization reactions.

“Stringent conditions” or “high stringency conditions”, as definedherein, typically: (1) employ low ionic strength and high temperaturefor washing, for example 0.015 M sodium chloride/0.0015 M sodiumcitrate/0.1% sodium dodecyl sulfate at 50° C.; (2) employ duringhybridization a denaturing agent, such as formamide, for example, 50%(v/v) formamide with 0.1% bovine serum albumin/0.1% Ficoll/0.1%polyvinylpyrrolidone/50 mM sodium phosphate buffer at pH 6.5 with 750 mMsodium chloride, 75 mM sodium citrate at 42° C.; or (3) employ 50%formamide, 5×SSC (0.75 M NaCl, 0.075 M sodium citrate), 50 mM sodiumphosphate (pH 6.8), 0.1% sodium pyrophosphate, 5×Denhardt's solution,sonicated salmon sperm DNA (50 .mu.g/ml), 0.1% SDS, and 10% dextransulfate at 42° C., with washes at 42° C. in 0.2×SSC (sodiumchloride/sodium citrate) and 50% formamide at 55° C., followed by ahigh-stringency wash consisting of 0.1×SSC containing EDTA at 55° C.

“Moderately stringent conditions” may be identified conventionally⁷⁰,and include the use of washing solution and hybridization conditions(e.g., temperature, ionic strength and % SDS) less stringent that thosedescribed above. An example of moderately stringent conditions isovernight incubation at 37° C. in a solution comprising: 20% formamide,5×SSC (150 mM NaCl, 15 mM trisodium citrate), 50 mM sodium phosphate (pH7.6), 5×Denhardt's solution, 10% dextran sulfate, and 20 mg/ml denaturedsheared salmon sperm DNA, followed by washing the filters in 1×SSC atabout 37-50° C. The skilled artisan will recognize how to adjust thetemperature, ionic strength, etc. as necessary to accommodate factorssuch as probe length and the like, by use of manufacturer's instructions(see, e.g., Illumina system instructions).

In the context of the compositions and methods described herein,reference to “three or more,” “at least five,” etc. of the genes listedin any particular gene set (e.g., Table I to VII) means any one or anyand all combinations of the genes listed. For example, suitable geneexpression profiles include profiles containing any number between atleast 3 through 100 genes from those Tables. In one embodiment, geneprofiles formed by genes selected from a table are preferably used inrank order, e.g., genes ranked in the top of the list demonstrated moresignificant discriminatory results in the tests, and thus may be moresignificant in a profile than lower ranked genes. However, in otherembodiments the genes forming a useful gene profile do not have to be inrank order and may be any gene from the respective table.

The terms “splicing” and “RNA splicing” are used interchangeably andrefer to RNA processing that removes introns and joins exons to producemature mRNA with continuous coding sequence that moves into thecytoplasm of an eukaryotic cell.

In theory, the term “exon” refers to any segment of an interrupted genethat is represented in the mature RNA product⁷¹. In theory the term“intron” refers to any segment of DNA that is transcribed but removedfrom within the transcript by splicing together the exons on either sideof it. Operationally, exon sequences occur in the mRNA sequence of agene. Operationally, intron sequences are the intervening sequenceswithin the genomic DNA of a gene, bracketed by exon sequences and havingGT and AG splice consensus sequences at their 5′ and 3′ boundaries.

As used herein, “labels” or “reporter molecules” are chemical orbiochemical moieties useful for labeling a nucleic acid (including asingle nucleotide), polynucleotide, oligonucleotide, or protein ligand,e.g., amino acid or antibody. “Labels” and “reporter molecules” includefluorescent agents, chemiluminescent agents, chromogenic agents,quenching agents, radionucleotides, enzymes, substrates, cofactors,inhibitors, magnetic particles, and other moieties known in the art.“Labels” or “reporter molecules” are capable of generating a measurablesignal and may be covalently or noncovalently joined to anoligonucleotide or nucleotide (e.g., a non-natural nucleotide) orligand.

Unless defined otherwise in this specification, technical and scientificterms used herein have the same meaning as commonly understood by one ofordinary skill in the art to which this invention belongs and byreference to published texts^(72,73), which provide one skilled in theart with a general guide to many of the terms used in the presentapplication.

II. THE GENE EXPRESSION PROFILES

The inventors identified diagnostic gene expression profiles in theperipheral blood lymphocytes of lung cancer patients. The inventors havediscovered that the gene expression profiles of the PBMCs of lung cancerpatients differ significantly from those seen in appropriately matched(i.e. by age, sex, smoking history) controls. For example, changes inthe gene expression products of the genes of these profiles can beobserved and detected by the methods of this invention in the normalcirculating PBMC of patients with early stage solid lung tumors.

The gene expression profiles described herein provide new diagnosticmarkers for the early detection of lung cancer and could preventpatients from undergoing unnecessary procedures (i.e. if a small lungnodule is discovered) or potential be used to screen high risk patients.Since the risks are very low, the benefit to risk ratio is very high.The methods and compositions described herein may also be useful inother populations, i.e., to screen certain high-risk lung cancerpopulations, such as asbestos exposed smokers. In yet anotherembodiment, the methods and compositions described herein may be used inconjunction with clinical risk factors to help physicians make moreaccurate decisions about how to manage patients with lung nodules.Another advantage of this invention is that diagnosis may occur earlysince diagnosis is not dependent upon detecting circulating tumor cellswhich are present in only vanishing small numbers in early stage lungcancers.

Because the effects of smoking and/or chronic obstructive pulmonarydiseases on the PBMC profile have the potential to obscure the resultsof diagnostic methods based on gene profiles, as detailed below, theeffects of current smoking, former smoking, and COPD are specificallyaddressed in the compositions and methods herein by use of appropriatepopulations of matched controls. In one embodiment, the appropriatecontrol class for the comparative studies is at risk smokers andex-smokers with non-malignant lung disease so that the smoking relatedhistories of both patient subject and control subjects are very similar.The data presented in the examples below clearly indicate that theinventors detect a cancer signature in the presence of a background ofsmoking and/or COPD.

In one embodiment, a novel gene expression profile or signature canidentify and distinguish patients with early stage (T1/T2—primarilyStage I/II) non small cell cancers of the lung (NSCLC) from theappropriate control group of smokers and ex-smokers at high risk fordeveloping lung cancer matched by age, gender, and race. See for examplethe genes identified in Table I which may form a suitable geneexpression profile and those of Table IV, column “ALL/NHC”. In anotherembodiment, a novel gene expression profile or signature can identifypatients with early stage (T1/T2) AC tumors (primarily Stage I and II),in comparison to the closely related NHC control. See Table II and TableIV, column “AC/NHC”. The validity of these methods and gene expressionprofiles is supported in experimental data measuring the lung cancer“score” in patients before and after surgery. In another embodiment, thegene collections in Table III and Table IV, column PRE/POST provide adiscrete number of genes that form a suitable profile. Thesepatient/control populations were distinguished by generating adiscriminant score based on differences in gene expression profiles asexemplified below. In one embodiment, a 15 gene classifier, i.e., a setof genes that form a gene expression profile, can distinguish betweenearly stage AC tumor vs. non healthy control profiles with an accuracyof 85%. That gene expression profile is identified in Table IV, column“ALL/NHC” below. Additionally, the inventors have identified a geneexpression profile classifier that distinguishes both AC and LSCCpatients from NHC with an accuracy of 83% also requiring 15 genes forthe profile. That gene expression profile is identified in Table IV,col. “AC/NHC” below. A similar gene expression profile to distinguishpre-surgery from post-surgery patients is also found in Table IV, col.“PRE/POST” below. The data shown in the examples clearly indicates thatthere is a shared early stage cancer-specific signature that is separatefrom the patterns that discriminate cancer types (AC vs. LSCC) and thatdiscriminate cancer stage (early vs. late).

More recent data described in Examples 14-18 below provide a new 29 geneexpression signature to diagnose subjects with lung cancer from healthyor non-healthy controls (Table V, genes ranked 1-29), as well asadditional genes from that table that can form other signatures. Therelatively small panel of 29 genes can distinguish early stage NSCLC(Stage 1A-1B) from a highly similar control group with good accuracy.Additionally, a set of 4 genes from the 50 gene selection of Table VI isuseful to distinguish and track post-surgical improvement. Further, anew 24 gene expression profile to discriminate between lung cancersubjects and subjects with benign lung nodules is provided in Table VII.The data shown in these examples demonstrates lung cancer genesignatures useful in both diagnosis and evaluating the progress oftreatment.

As described in detail in the examples below, by comparing geneexpression in PBMC from a large group of NSCLC patients to a comparablegroup of patients with non-malignant lung diseases, a tumor inducedsignature was detected, in smokers and non-smokers, which can bedistinguished from effects of smoking induced non-malignant lungdisease. As demonstrated in the examples below, diagnostic signaturesare identified in PBMC that distinguish patients with early stage NSCLCfrom at-risk controls with non-malignant lung disease balanced forsmoking, age and gender as well as incidence of COPD. There were also 14NSCLC patients in these examples that had no prior history of smoking.Lung cancer in individuals who have never smoked has been shown to haveseveral important differences from tobacco associated lung tumors andsome molecular changes that occur have been suggested to be unique tonon-smokers^(28,29). 11 of the 14 never-smokers were correctlyclassified as cancer by the 29 gene classifier, suggesting that theeffect on PBMC gene expression of lung cancers in smokers andnon-smokers is similar, at least with respect to the PBMC genesignatures.

Fourteen genes associated with nicotinate and nicotinamide metabolismwere statistically significantly lower in NSCLC patients when comparedto all the controls or compared only to controls with benign lungnodules suggesting these pathways may be suppressed in NSCLC patients.Differences detected in PBMC between patients before and after surgicalresection were numerous. However, 2 of the 4 most informative genes thatdistinguish the pre-versus post surgery samples have mitochondrialfunctions. Mitochondrial genes in general are higher pre-surgerysuggesting the increased requirements for energy described for tumorsare also reflected in the PBMC when the tumor is present. Highlysignificant pathways that were higher in pre-surgery samples wereassociated with NK cell function, and ceramide signaling, [NK: 29 genes(p<2.08×10⁻⁸), ceramide: 17 genes (p<8.83×10⁻⁵)]. The most significantlydown regulated pathways included apoptosis and death receptor genes(Apoptosis: 15 genes (p<1.74×10⁻²), Death receptor: 13 genes(p<1.37×10⁻³) patterns also characteristic of tumors^(31,32). Theobserved reduction of the NSCLC cancer signature and the highlysignificant common differences shown by patients post-surgery supportsthe conclusion that the signatures described herein are tumor induced.

Specific interactions between the tumor, lymphocytes and tumor-releasedfactors contribute to the changes seen in PBMC gene expression and theseeffects are enhanced in tumor progression, as evidenced by the increasedaccuracy of our gene panel in classifying late stage NSCLC.

The validity of these signatures was established on samples collected atdifferent locations by different groups and in a cohort of patients withundiagnosed lung nodules. The gene expression profiles identified belowby use of ILLUMINA arrays provide global diagnostic signatures toidentify patients with lung cancers of various cell types, and providecell type specific diagnostic signatures. Further the profiles take intoaccount race, gender and smoking history. The inventors have also testedsamples from a group of patients before and after lung cancer surgery,thus eliminating person-to person-variability in assessing the tumoreffect. The lung cancer signature consistently diminishes or disappearsafter removal of the tumor. This result, as discussed in the examplesbelow, strongly supports the identification of a PBMC signature forearly stage lung cancer. This data (see Example 12) shows a consistentdecrease in each patient's lung cancer score after surgical removal ofthe cancer as compared to that score before surgery.

The lung cancer signatures or gene expression profiles identified hereinand through use of the gene collections of Tables I-VII may be furtheroptimized to reduce the numbers of gene expression products necessaryand increase accuracy of diagnosis.

While not wishing to be bound by theory, the inventors' use of geneexpression studies of PBMC in disease is based on the proposition thatcirculating PBMC (peripheral blood mononuclear cells-primarily monocytesand lymphocytes) are affected by localized processes that involveinflammation and/or tumors. This can occur by at least two mechanisms.First, the cells can directly interact in the tissues of theinflammation or tumor. Clearly, a key function of lymphocytes is to“patrol” the tissues of the body, temporary arrest in abnormal areas,egress from tissues, interact with lymph nodal tissues, becomeactivated, and then re-enter the circulation (with some reentering thetissues). This close interaction clearly alters their phenotype. Asecond, and probably equally important process is the response of thePBMC to circulating factors released by cells in the inflammatoryresponse or tumors. Many such factors have been described, includingcolony stimulating factors (such as G-CSF, GM-CSF), cytokines (i.e.,TNF, IL-2, IL-3, IL4, and IL-, IL-7, IL-15, etc), chemokines (MCP-1,SDF-1), growth factors (such as Flt-3 ligand, VEGF), immunosuppressivefactors (such as IL-10, COX-1, TGF-β), etc. These factors affectimmature cells in the bone marrow which are then released into thecirculation, as well as cells already in the circulating compartment.This later mechanism likely affects both the phenotype of released cellsand the type of cells released (i.e. early after infection there is aninflux of immature neutrophils in the circulation).

Although inflammatory lesions and tumors have some similarities, thereare many differences, a very important one being the well known abilityof tumors to suppress immune responses. The cancer signaturesestablished by the gene expression profiles described herein can bedifferentiated from an inflammatory signature.

III. GENE EXPRESSION PROFILING METHODS

Methods of gene expression profiling that were used in generating theprofiles useful in the compositions and methods described herein or inperforming the diagnostic steps using the compositions described hereinare known and well summarized in U.S. Pat. No. 7,081,340. Such methodsof gene expression profiling include methods based on hybridizationanalysis of polynucleotides, methods based on sequencing ofpolynucleotides, and proteomics-based methods. The most commonly usedmethods known in the art for the quantification of mRNA expression in asample include northern blotting and in situ hybridization⁷⁴; RNAseprotection assays⁷⁵; and PCR-based methods, such as RT-PCR⁷⁶.Alternatively, antibodies may be employed that can recognize specificduplexes, including DNA duplexes, RNA duplexes, and DNA-RNA hybridduplexes or DNA-protein duplexes. Representative methods forsequencing-based gene expression analysis include Serial Analysis ofGene Expression (SAGE), and gene expression analysis by massivelyparallel signature sequencing (MPSS).

A. Polymerase Chain Reaction (PCR) Techniques

The most sensitive and most flexible quantitative method is RT-PCR,which can be used to compare mRNA levels in different samplepopulations, in normal and tumor tissues, with or without drugtreatment, to characterize patterns of gene expression, to discriminatebetween closely related mRNAs, and to analyze RNA structure. The firststep is the isolation of mRNA from a target sample (e.g., typicallytotal RNA isolated from human PBMC in this case). mRNA can be extracted,for example, from frozen or archived paraffin-embedded and fixed (e.g.formalin-fixed) tissue samples.

General methods for mRNA extraction are well known in the art, suchstandard textbooks of molecular biology⁷⁷. Methods for RNA extractionfrom paraffin embedded tissues are known^(78,79). In particular, RNAisolation can be performed using purification kit, buffer set andprotease from commercial manufacturers, according to the manufacturer'sinstructions. Exemplary commercial products include TRI-REAGENT, QiagenRNeasy mini-columns, MASTERPURE Complete DNA and RNA Purification Kit(EPICENTRE®, Madison, Wis.), Paraffin Block RNA Isolation Kit (Ambion,Inc.) and RNA Stat-60 (Tel-Test). Conventional techniques such as cesiumchloride density gradient centrifugation may also be employed.

The first step in gene expression profiling by RT-PCR is the reversetranscription of the RNA template into cDNA, followed by its exponentialamplification in a PCR reaction. The two most commonly used reversetranscriptases are avilo myeloblastosis virus reverse transcriptase(AMV-RT) and Moloney murine leukemia virus reverse transcriptase(MMLV-RT). The reverse transcription step is typically primed usingspecific primers, random hexamers, or oligo-dT primers, depending on thecircumstances and the goal of expression profiling. See, e.g.,manufacturer's instructions accompanying the product GENEAMP RNA PCR kit(Perkin Elmer, Calif., USA). The derived cDNA can then be used as atemplate in the subsequent RT-PCR reaction.

The PCR step generally uses a thermostable DNA-dependent DNA polymerase,such as the Taq DNA polymerase, which has a 5′-3′ nuclease activity butlacks a 3′-5′ proofreading endonuclease activity. Thus, TAQMAN® PCRtypically utilizes the 5′-nuclease activity of Taq or Tth polymerase tohydrolyze a hybridization probe bound to its target amplicon, but anyenzyme with equivalent 5′ nuclease activity can be used. Twooligonucleotide primers are used to generate an amplicon typical of aPCR reaction. A third oligonucleotide, or probe, is designed to detectnucleotide sequence located between the two PCR primers. The probe isnon-extendible by Taq DNA polymerase enzyme, and is labeled with areporter fluorescent dye and a quencher fluorescent dye. Anylaser-induced emission from the reporter dye is quenched by thequenching dye when the two dyes are located close together as they areon the probe. During the amplification reaction, the Taq DNA polymeraseenzyme cleaves the probe in a template-dependent manner. The resultantprobe fragments disassociate in solution, and signal from the releasedreporter dye is free from the quenching effect of the secondfluorophore. One molecule of reporter dye is liberated for each newmolecule synthesized, and detection of the unquenched reporter dyeprovides the basis for quantitative interpretation of the data.

TaqMan® RT-PCR can be performed using commercially available equipment.In a preferred embodiment, the 5′ nuclease procedure is run on areal-time quantitative PCR device such as the ABI PRISM 7900® SequenceDetection System®. The system amplifies samples in a 96-well format on athermocycler. During amplification, laser-induced fluorescent signal iscollected in real-time through fiber optic cables for all 96 wells, anddetected at the CCD. The system includes software for running theinstrument and for analyzing the data. 5′-Nuclease assay data areinitially expressed as Ct, or the threshold cycle. As discussed above,fluorescence values are recorded during every cycle and represent theamount of product amplified to that point in the amplification reaction.The point when the fluorescent signal is first recorded as statisticallysignificant is the threshold cycle (C_(t)).

To minimize errors and the effect of sample-to-sample variation, RT-PCRis usually performed using an internal standard. The ideal internalstandard is expressed at a constant level among different tissues, andis unaffected by the experimental treatment. RNAs most frequently usedto normalize patterns of gene expression are mRNAs for the housekeepinggenes glyceraldehyde-3-phosphate-dehydrogenase (GAPDH) and β-actin.

Real time PCR is comparable both with quantitative competitive PCR,where internal competitor for each target sequence is used fornormalization, and with quantitative comparative PCR using anormalization gene contained within the sample, or a housekeeping genefor RT-PCR.¹¹⁰

In another PCR method, i.e., the MassARRAY-based gene expressionprofiling method (Sequenom, Inc., San Diego, Calif.), following theisolation of RNA and reverse transcription, the obtained cDNA is spikedwith a synthetic DNA molecule (competitor), which matches the targetedcDNA region in all positions, except a single base, and serves as aninternal standard. The cDNA/competitor mixture is PCR amplified and issubjected to a post-PCR shrimp alkaline phosphatase (SAP) enzymetreatment, which results in the dephosphorylation of the remainingnucleotides. After inactivation of the alkaline phosphatase, the PCRproducts from the competitor and cDNA are subjected to primer extension,which generates distinct mass signals for the competitor- andcDNA-derived PCR products. After purification, these products aredispensed on a chip array, which is pre-loaded with components neededfor analysis with matrix-assisted laser desorption ionizationtime-of-flight mass spectrometry (MALDI-TOF MS) analysis. The cDNApresent in the reaction is then quantified by analyzing the ratios ofthe peak areas in the mass spectrum generated⁸².

Still other embodiments of PCR-based techniques which are known to theart and may be used for gene expression profiling include, e.g.,differential display, amplified fragment length polymorphism (iAFLP),and BeadArray™ technology (Illumina, San Diego, Calif.) using thecommercially available Luminex100 LabMAP system and multiple color-codedmicrospheres (Luminex Corp., Austin, Tex.) in a rapid assay for geneexpression; and high coverage expression profiling (HiCEP) analysis.

As described in more detail in the examples, below, the gene expressionprofiles for lung cancer classifications were collected as follows. RNAexpression profiles are obtained by purification of PBMC from the bloodof subjects by centrifugation using a CPT tube, a Ficoll gradient orequivalent density separation to remove red cells and granulocytes andsubsequent extraction of the RNA using TRIZOL tri-reagent, RNALATERreagent or a similar reagent to obtain RNA of high integrity. The amountof individual messenger RNA species was determined using microarraysand/or Quantitative polymerase chain reaction.

After analysis of the RNA concentration, RNA repair and/or amplificationsteps the RNA is reverse transcribed using gene specific promotersfollowed by RT-PCR. Finally, the data are analyzed to identify thecharacteristic gene expression pattern identified in the PBMC sampleexamined. The expression profiles characteristics of the disease to bediagnosed were compared and analyzed pairwise with an SVM algorithm(SVM-RCE)¹ (described in Examples 4 and 5) and with an alternativemethodology described in Example 14 below. These methods can also bedemonstrated using the a similar machine-learning algorithm, such as SVMwith Recursive Feature Elimination (SVM-RFE) or other classificationalgorithm such as Penalized Discriminant Analysis (PDA) (seeInternational Patent Application Publication No WO 2004/105573,published Dec. 9, 2004) to obtain a mathematical function whosecoefficients act on the input RNA gene express values and output a“SCORE” whose value determines the class of the individual and theconfidence of the prediction. Having determined this function byanalysis of numerous subjects known to be of the classes whose membersare to be subsequently distinguished, it is used to classify subjectsfor their disease states.

In performing assays and methods of this invention, these sametechniques are used, the patient's profile compared with the appropriatereference profile, and diagnosis or treatment recommendation selectedbased on this information.

B. Microarrays

Differential gene expression can also be identified, or confirmed usingthe microarray technique. Thus, the expression profile of lungcancer-associated genes can be measured in either fresh orparaffin-embedded tissue, using microarray technology. In this method,polynucleotide sequences of interest (including cDNAs andoligonucleotides) are plated, or arrayed, on a microchip substrate. Thearrayed sequences are then hybridized with specific DNA probes fromcells or tissues of interest. Just as in the RT-PCR for the purposes ofthe methods and compositions herein, the source of mRNA is total RNAisolated from PBMC of controls and patient subjects.

In one embodiment of the microarray technique, PCR amplified inserts ofcDNA clones are applied to a substrate in a dense array. Preferably atleast 10,000 nucleotide sequences are applied to the substrate. Themicroarrayed genes, immobilized on the microchip at 10,000 elementseach, are suitable for hybridization under stringent conditions.Fluorescently labeled cDNA probes may be generated through incorporationof fluorescent nucleotides by reverse transcription of RNA extractedfrom tissues of interest. Labeled cDNA probes applied to the chiphybridize with specificity to each spot of DNA on the array. Afterstringent washing to remove non-specifically bound probes, the chip isscanned by confocal laser microscopy or by another detection method,such as a CCD camera. Quantitation of hybridization of each arrayedelement allows for assessment of corresponding mRNA abundance. With dualcolor fluorescence, separately labeled cDNA probes generated from twosources of RNA are hybridized pairwise to the array. The relativeabundance of the transcripts from the two sources corresponding to eachspecified gene is thus determined simultaneously. The miniaturized scaleof the hybridization affords a convenient and rapid evaluation of theexpression pattern for large numbers of genes. Such methods have beenshown to have the sensitivity required to detect rare transcripts, whichare expressed at a few copies per cell, and to reproducibly detect atleast approximately two-fold differences in the expression levels.Microarray analysis can be performed by commercially availableequipment, following manufacturer's protocols.

Other useful methods summarized by U.S. Pat. No. 7,081,340, andincorporated by reference herein include Serial Analysis of GeneExpression (SAGE) and Massively Parallel Signature Sequencing (MPSS).

C. Immunohistochemistry

Immunohistochemistry methods are also suitable for detecting theexpression levels of the gene expression products of the informativegenes described for use in the methods and compositions herein.Antibodies or antisera, preferably polyclonal antisera, and mostpreferably monoclonal antibodies, or other protein-binding ligandsspecific for each marker are used to detect expression. The antibodiescan be detected by direct labeling of the antibodies themselves, forexample, with radioactive labels, fluorescent labels, hapten labels suchas, biotin, or an enzyme such as horse radish peroxidase or alkalinephosphatase. Alternatively, unlabeled primary antibody is used inconjunction with a labeled secondary antibody, comprising antisera,polyclonal antisera or a monoclonal antibody specific for the primaryantibody. Protocols and kits for immunohistochemical analyses are wellknown in the art and are commercially available.

D. Proteomics

The term “proteome” is defined as the totality of the proteins presentin a sample (e.g. tissue, organism, or cell culture) at a certain pointof time. Proteomics includes, among other things, study of the globalchanges of protein expression in a sample (also referred to as“expression proteomics”). Proteomics typically includes the followingsteps: (1) separation of individual proteins in a sample by 2-D gelelectrophoresis (2-D PAGE); (2) identification of the individualproteins recovered from the gel, e.g. by mass spectrometry or N-terminalsequencing, and (3) analysis of the data using bioinformatics.Proteomics methods are valuable supplements to other methods of geneexpression profiling, and can be used, alone or in combination withother methods, to detect the gene expression products of the geneprofiles described herein.

IV. COMPOSITIONS OF THE INVENTION

The methods for diagnosing lung cancer utilizing defined gene expressionprofiles permits the development of simplified diagnostic tools fordiagnosing lung cancer, e.g., NSCLC or diagnosing a specific stage(early, stage I, stage II or late) of lung cancer, diagnosing a specifictype of lung cancer (e.g., AC vs. LSCC) or monitoring the effect oftherapeutic or surgical intervention for determination of furthertreatment or evaluation of the likelihood of recurrence of the cancer.

Thus, a composition for diagnosing non-small cell lung cancer in amammalian subject as described herein can be a kit or a reagent. Forexample, one embodiment of a composition includes a substrate upon whichsaid polynucleotides or oligonucleotides or ligands are immobilized. Inanother embodiment, the composition is a kit containing the relevantthree or more polynucleotides or oligonucleotides or ligands, optionaldetectable labels for same, immobilization substrates, optionalsubstrates for enzymatic labels, as well as other laboratory items. Instill another embodiment, at least one polynucleotide or oligonucleotideor ligand is associated with a detectable label.

Such a composition contains in one embodiment three or morepolynucleotides or oligonucleotides, wherein each polynucleotide oroligonucleotide hybridizes to a different gene, gene fragment, genetranscript or expression product from mammalian peripheral bloodmononuclear cells (PBMC), wherein said gene, gene fragment, genetranscript or expression product is selected from (i) the genes of TableI; (ii) the genes of Table II; (iii) the genes of Table III; and (iv)the genes of Table IV. In another embodiment, such a compositioncontains three or more polynucleotides or oligonucleotides, wherein eachpolynucleotide or oligonucleotide hybridizes to a different gene, genefragment, gene transcript or expression product from mammalianperipheral blood mononuclear cells (PBMC), wherein said gene, genefragment, gene transcript or expression product is selected from (i) thegenes of Table V; (ii) the genes of Table VI; or (iii) the genes ofTable VII.

In another embodiment, such a composition contains three or moreligands, wherein each ligand binds to a different gene expressionproduct from mammalian peripheral blood mononuclear cells (PBMC),wherein the gene expression product is the product of a gene selectedfrom (i) the genes of Table I; (ii) the genes of Table II; (iii) thegenes of Table III; and (iv) the genes of Table IV. In still anotherembodiment, such a composition contains three or more ligands, whereineach ligand binds to a different gene expression product from mammalianperipheral blood mononuclear cells (PBMC), wherein the gene expressionproduct is the product of a gene selected from (i) the genes of Table V;(ii) the genes of Table VI; or (iii) the genes of Table VII.

In one embodiment, a composition for diagnosing lung cancer in amammalian subject includes three or more PCR primer-probe sets. Eachprimer-probe set amplifies a different polynucleotide sequence from agene expression product of three or more informative genes found in theperipheral blood mononuclear cells (PBMC) of the subject. Theseinformative genes are selected to form a gene expression profile orsignature which is distinguishable between a subject having lung cancerand a selected reference control. Changes in expression in the genes inthe gene expression profile from that of a reference gene expressionprofile are correlated with a lung cancer, such as non-small cell lungcancer (NSCLC).

In one embodiment of this composition, the informative genes areselected from among the genes identified in Table I below. Table Icontains the approximately top 100 genes identified by the inventors asrepresentative of a genomic signature indicative of the presence of anyNSCLC lung cancer. This collection of genes is those for which the geneproduct expression is altered (i.e., increased or decreased) versus thesame gene product expression in the PBMC of a reference control. In oneembodiment, polynucleotide or oligonucleotides, such as PCR primers andprobes, are generated to three or more informative genes from Table Ifor use in the composition. An example of such a composition containsprimers and probes to a targeted portion of the first three genes inthat Table. In another embodiment, PCR primers and probes are generatedto at least six informative genes from Table I for use in thecomposition. An example of such a composition contains primers andprobes to a targeted portion of the first six genes in that Table. Instill another embodiment, PCR primers and probes are generated to atleast fifteen informative genes from Table I for use in the composition.An example of such a composition contains primers and probes to atargeted portion of the first fifteen genes in that Table. Still otherembodiments employ primers and probes to a targeted portion of othercombinations of the genes in the Tables. The selected genes from theTable need not be in rank order; rather any combination that clearlyshows a difference in expression between the reference control to thediseased patient is useful in such a composition.

In one specific embodiment, the informative genes from Table I comprisethree or more genes selected from the group consisting of IGSF6,HSPA8(A), LYN, DNCL1, HSPA1A, DPYSL2, HAGK, HSPA8(I), NFKBIA, FGL2,CALM2, CCL5, RPS2, DDIT4 and C1orf63.

In another embodiment of this composition, the informative genes areselected from among the genes identified in Table II below. Table IIcontains the approximately top 100 genes identified by the inventors asrepresentative of a genomic signature indicative of the presence of aspecific NSCLC, i.e., lung adenocarcinoma. This collection of genes isthose for which the gene product expression is altered (i.e., increasedor decreased) versus the same gene product expression in the PBMC of areference control. In one embodiment, PCR primers and probes aregenerated to three or more informative genes from Table II for use inthe composition. An example of such a composition contains primers andprobes to a targeted portion of the first three genes in Table II. Inanother embodiment, PCR primers and probes are generated to at least sixinformative genes from Table II for use in the composition. An exampleof such a composition contains primers and probes to a targeted portionof the first six genes in Table II. In still another embodiment, PCRprimers and probes are generated to at least fifteen informative genesfrom Table II for use in the composition. An example of such acomposition contains primers and probes to a targeted portion of thefirst fifteen genes in that Table II. Still other embodiments employprimers and probes to a targeted portion of other combinations of thegenes in Table II. The selected genes from Table II need not be in rankorder; rather any combination that clearly shows a difference inexpression between the reference control to the diseased patient isuseful in such a composition.

In one specific embodiment, the informative genes from Table II comprisethree or more genes selected from the group consisting of ETS1, CCL5,DDIT4, CXCR4, DNCL1, MS4ABA, ATP5B, HSPA8(A), ADM PTPN6, ARHGAP9,S100A8, DPYSL2, HSPA1A, and NFKBIA.

In another embodiment of this composition, the informative genes areselected from among the genes identified in Table III. Table IIIcontains the top 100 genes identified by the inventors as representativeof a genomic signature indicative of the effect of surgical resection ofthe tumor of a patient with an NSCLC. This collection of genes is thosefor which the gene product expression is altered (i.e., increased ordecreased) versus the same gene product expression in the PBMC of apatient before and after surgery. In one embodiment, PCR primers andprobes are generated to three or more informative genes from Table IIIfor use in the composition. An example of such a composition containsprimers and probes to a targeted portion of the first three genes inTable III. In another embodiment, PCR primers and probes are generatedto at least six informative genes from Table III for use in thecomposition. An example of such a composition contains primers andprobes to a targeted portion of the first six genes in Table III. Instill another embodiment, PCR primers and probes are generated to atleast fifteen informative genes from Table III for use in thecomposition. An example of such a composition contains primers andprobes to a targeted portion of the first fifteen genes in that TableIII. Still other embodiments employ primers and probes to a targetedportion of other combinations of the genes in Table III. The selectedgenes from Table III need not be in rank order; rather any combinationthat clearly shows a difference in expression between pre-surgery NSCLCpatient compared with post-surgery NSCLC patient is useful in such acomposition.

In another embodiment of this composition, the informative genes areselected from among the genes identified in Table IV. Table IV containsembodiments of 15 genes useful as representative genomic signatures orprofiles for three diagnostic uses, i.e., to distinguish between NSCLCand all controls, to distinguish between NSCLC in general andadenocarcinoma and to distinguish between and thus track progression ofdisease in pre and post-surgical subjects. In one embodiment, PCRprimers and probes are generated to all 15 informative genes from TableIV, col. 1 for use in a diagnostic composition. In another embodiment,PCR primers and probes are generated to 15 informative genes from TableIV, col. 2 for use in a diagnostic composition. In still anotherembodiment, PCR primers and probes are generated to fifteen informativegenes from Table IV, col. 3 for use in a diagnostic composition. Stillother embodiments employ primers and probes to a targeted portion ofother combinations of the genes in Table IV. The selected genes fromTable IV need not be in rank order; rather any combination that clearlyshows a difference between test subject and the compared groups isuseful in such a composition.

In another embodiment of this composition, the informative genes areselected from among the genes identified in Table V. Table V containsembodiments of 136 genes useful as representative genomic signatures orprofiles to distinguish between NSCLC and all controls, primarilynon-healthy controls. In one embodiment, PCR primers and probes aregenerated to the top ranked 29 informative genes from Table V, therebyforming the 29 gene classifier of the examples below for use in adiagnostic composition. In still another embodiment, PCR primers andprobes are generated to any desired number of informative genes fromTable V for use in a diagnostic composition. The selected genes fromTable V need not be in rank order; rather any combination that clearlyshows a difference between test subject and the compared groups isuseful in such a composition.

In another embodiment of this composition, the informative genes areselected from among the genes identified in Table VI. Table VI containsembodiments of 50 genes useful as representative genomic signatures orprofiles to distinguish between presurgical and postsurgical subjects.In one embodiment, PCR primers and probes are generated to the topranked 2 informative genes, e.g., CYP2R1 and MYO5B, from Table VI foruse in a diagnostic composition. In still another embodiment, PCRprimers and probes are generated to the top four gene, e.g., CYP2R1,MYO5B, DGUOK and DYNLL1, from Table VI for use in a diagnosticcomposition. In a further composition, oligonucleotides orpolynucleotides, such as PCR primers and probes, that hybridize oramplify any desired number of informative genes from Table VI are usefulin a diagnostic composition. The selected genes from Table VI need notbe in rank order; rather any combination that clearly shows a differencebetween test subject and the compared groups is useful in such acomposition.

In another embodiment of this composition, the informative genes areselected from among the genes identified in Table VII. Table VIIcontains embodiments of 24 genes useful as representative genomicsignatures or profiles to distinguish between NSCLC subjects andsubjects with benign lung nodules. In one embodiment, oligonucleotidesor polynucleotides, such as PCR primers and probes, are generated to all24 informative genes from Table VII for use in a diagnostic composition.In still another embodiment, PCR primers and probes are generated to anysmall number of genes from Table VII for use in a diagnosticcomposition. The selected genes from Table VII need not be in rankorder; rather any combination that clearly shows a difference betweentest subject and the compared groups is useful in such a composition.

In one embodiment of the compositions described above, the referencecontrol is a non-healthy control (NHC) as described above. In otherembodiments, the reference control may be any class of controls asdescribed above in “Definitions”. A composition containingpolynucleotides or oligonucleotides that hybridize to the members of theselected gene expression profile prepared from a selection of geneslisted in these tables is desirable not only for diagnosis, but formonitoring the effects of surgical or non-surgical therapeutic treatmentto determine if the positive effects of resection/chemotherapy aremaintained for a long period after initial treatment. These profilesalso permit a determination of recurrence or the likelihood ofrecurrence of a lung cancer, e.g., NSCLC, if the results demonstrate areturn to the pre-surgery/pre-chemotherapy profiles. It is furtherlikely that these compositions may also be employed for use inmonitoring the efficacy of non-surgical therapies for lung cancer.

The compositions based on the genes selected from Tables I through VIIdescribed herein, optionally associated with detectable labels, can bepresented in the format of a microfluidics card, a chip or chamber, or akit adapted for use with the PCR, RT-PCR or Q PCR techniques describedabove. In one aspect, such a format is a diagnostic assay using TAQMAN®Quantitative PCR low density arrays. Preliminary results suggest thenumber of genes required is compatible with these platforms. When asample of PBMC from a selected patent subject is contacted with theprimers and probes in the composition, PCR amplification of targetedinformative genes in the gene expression profile from the patientpermits detection of changes in expression in the genes in the geneexpression profile from that of a reference gene expression profile.Significant changes in the gene expression of the informative genes inthe patient's PBMC from that of the reference gene expression profilecorrelate with a diagnosis of lung cancer when using compositionsdirected to the genes of Table I or V, or of lung adenocarcinoma whenusing compositions directed to the genes of Table II Similarly, when asample of PBMC from a selected post-surgical patent subject is contactedwith the primers and probes in the composition, PCR amplification oftargeted informative genes selected from those of Table III or VI in thegene expression profile from the patient permits detection of changes inexpression in the genes in the gene expression profile from that of areference gene expression profile. In this circumstance a preferredreference profile is that obtained from the same patient (or a similarpatient) prior to surgery. Significant changes in the gene expression ofthe informative genes in the patient's PBMC from that of the referencegene expression profile correlate with a positive effect of surgery,and/or maintenance of the positive effect.

Tables I through VII and the identifying information on the genes listedtherein are described below.

TABLE I GENE NAME Symbol Score Rank TSC22 domain family, member 3(TSC22D3), transcript TSC22D3 0.9522 1 variant 2, mRNA. (A) chemokine(C-X-C motif) receptor 4 (CXCR4), CXCR4 0.9444 2 transcript variant 1,mRNA. (A) dynein, cytoplasmic, light polypeptide 1 (DNCL1), DNCL1 0.86683 mRNA. (S) ribosomal protein S3 (RPS3), mRNA. (S) RPS3 0.8556 4DNA-damage-inducible transcript 4 (DDIT4), mRNA. DDIT4 0.8502 5 (S)granzyme B (granzyme 2, cytotoxic T-lymphocyte- GZMB 0.8148 6 associatedserine esterase 1) (GZMB), mRNA. (S) B-cell translocation gene 1,anti-proliferative (BTG1), BTG1 0.8 7 mRNA. (S) heat shock 70 kDaprotein 8 (HSPA8), transcript variant HSPA8 0.793 8 1, mRNA. (I)ribosomal protein L12 (RPL12), mRNA. (S) RPL12 0.7564 9 Src-like-adaptor(SLA), mRNA. (S) SLA 0.7322 10 runt-related transcription factor 3(RUNX3), transcript RUNX3 0.7306 11 variant 2, mRNA. (I) HGFL gene(MGC17330), mRNA. (S) MGC17330 0.6982 12 heat shock 70 kDa protein 1A(HSPA1A), mRNA. (S) HSPA1A 0.684 13 interleukin 18 receptor accessoryprotein (IL18RAP), IL18RAP 0.6728 14 mRNA. (S) cold inducible RNAbinding protein (CIRBP), mRNA. CIRBP 0.67 15 (S) adrenomedullin (ADM),mRNA. (S) ADM 0.662 16 CCAAT/enhancer binding protein (C/EBP), betaCEBPB 0.654 17 (CEBPB), mRNA. (S) PREDICTED similar to heterogeneousnuclear LOC645385 0.654 18 ribonucleoprotein A1 (LOC645385), mRNA. (S)CCAAT/enhancer binding protein (C/EBP), delta CEBPD 0.6416 19 (CEBPD),mRNA. (S) Kruppel-like factor 9 (KLF9), mRNA. (S) KLF9 0.6392 20PREDICTED: hypothetical protein LOC440345, LOC440345 0.6358 21transcript variant 6 (LOC440345), mRNA. (I) inhibitor of DNA binding 2,dominant negative helix- ID2 0.617 22 loop-helix protein (ID2), mRNA.(S) killer cell Ig-like receptor, two domains, long KIR2DL3 0.6126 23cytoplasmic tail, 3 (KIR2DL3), transcript variant 2, mRNA(A)arachidonate 5-lipoxygenase-activating protein ALOX5AP 0.6106 24(ALOX5AP), mRNA. (S) immunoglobulin superfamily, member 6 (IGSF6), IGSF60.6068 25 mRNA. (S) heat shock 70 kDa protein 8 (HSPA8), transcriptvariant HSPA8 0.6032 27 2, mRNA. (A) Tubulin, alpha, ubiquitous(K-ALPHA-1), mRNA. (S) K- 0.6002 28 ALPHA-1 protein kinase C, delta(PRKCD), transcript variant 2, PRKCD 0.5992 29 mRNA. (A) PR domaincontaining 1, with ZNF domain (PRDM1), PRDM1 0.594 30 transcript variant1, mRNA. (A) CD55 antigen, decay accelerating factor for complement CD550.5722 31 (Cromer blood group) (CD55), mRNA. (S) cystatin F(leukocystatin) (CST7), mRNA. (S) CST7 0.5698 32 myeloid-associateddifferentiation marker (MYADM), MYADM 0.568 33 transcript variant 4,mRNA. (A) major histocompatibility complex, class I, F (HLA-F), HLA-F0.568 34 mRNA. (S) SH2 domain protein 2A (SH2D2A), mRNA. (S) SH2D2A0.5656 35 potassium channel tetramerisation domain containing 12 KCTD120.5638 36 (KCTD12), mRNA. (S) Ras-GTPase-activating proteinSH3-domain-binding G3BP 0.5636 37 protein (G3BP), transcript variant 1,mRNA. (A) fibrinogen-like 2 (FGL2), mRNA. (S) FGL2 0.5552 38CCAAT/enhancer binding protein (C/EBP), alpha CEBPA 0.5368 39 (CEBPA),mRNA. (S) DnaJ (Hsp40) homolog, subfamily A, member 1 DNAJA1 0.5306 40(DNAJA1), mRNA. (S) capping protein (actin filament) muscle Z-line,alpha 2 CAPZA2 0.5244 41 (CAPZA2), mRNA. (S) general transcriptionfactor MA (GTF3A), mRNA. (S) GTF3A 0.523 42 IBR domain containing 2(IBRDC2), mRNA. (S) IBRDC2 0.5228 43 interferon stimulated exonucleasegene 20 kDa (ISG20), ISG20 0.5208 44 mRNA. (S) PREDICTED similar toribosomal protein L13a, LOC649564 0.5134 45 transcript variant 4(LOC649564), mRNA. (A) G protein-coupled receptor 171 (GPR171), mRNA.(S) GPR171 0.5124 46 killer cell immunoglobulin-like receptor, twodomains, KIR2DL4 0.5044 47 long cytoplasmic tail, 4 (KIR2DL4), mRNA. (S)sin3-associated polypeptide, 30 kDa (SAP30), mRNA. SAP30 0.4972 48 (S)PREDICTED: meteorin, glial cell differentiation METRNL 0.4936 49regulator-like (METRNL), mRNA. (I) chloride intracellular channel 3(CLIC3), mRNA. (S) CLIC3 0.4926 50 eukaryotic translation initiationfactor 3, subunit 12 EIF3S12 0.4912 51 (EIF3S12), mRNA. (S) insulinreceptor substrate 2 (IRS2), mRNA. (S) IRS2 0.4824 52 hepatitis A viruscellular receptor 2 (HAVCR2), mRNA. HAVCR2 0.4758 53 (S) HD domaincontaining 2 (HDDC2), mRNA. (S) HDDC2 0.4754 54 nuclear RNA exportfactor 1 (NXF1), mRNA. (S) NXF1 0.468 55 perforin 1 (pore formingprotein) (PRF1), mRNA. (S) PRF1 0.4642 56 SAM domain, SH3 domain andnuclear localisation SAMSN1 0.4614 57 signals, 1 (SAMSN1), mRNA. (S)TERF1 (TRF1)-interacting nuclear factor 2 (TINF2), TINF2 0.4604 58 mRNA.(S) endoplasmic reticulum-golgi intermediate compartment ERGIC1 0.455459 (ERGIC) 1 (ERGIC1), transcript variant 1, mRNA. (I) tumor necrosisfactor, alpha-induced protein 2 TNFAIP2 0.455 60 (TNFAIP2), mRNA. (S)AT-hook transcription factor (AKNA), mRNA. (S) AKNA 0.4548 61 adiposedifferentiation-related protein (ADFP), mRNA. ADFP 0.4546 62 (S)pyruvate dehydrogenase kinase, isozyme 4 (PDK4), PDK4 0.4538 63 mRNA.(S) apoptotic peptidase activating factor (APAF1), transcript APAF10.4486 64 variant 5, mRNA. (A) signal transducer and activator oftranscription 4 STAT4 0.4478 65 (STAT4), mRNA. (S) aldo-keto reductasefamily 1, member C3 (3-alpha AKR1C3 0.4454 66 hydroxysteroiddehydrogenase, type II), mRNA. (S) SH2 domain containing 3C (SH2D3C),transcript variant SH2D3C 0.4444 67 2, mRNA. (I) heat shock 105 kDa/110kDa protein 1 (HSPH1), mRNA. HSPH1 0.4396 68 (S)phosphoinositide-3-kinase, regulatory subunit 1 (p85 PIK3R1 0.4312 69alpha) (PIK3R1), transcript variant 2, mRNA. (A) presenilin associated,rhomboid-like (PSARL), mRNA. PSARL 0.4284 70 (S) deoxyguanosine kinase,nuclear gene encoding DGUOK 0.4272 71 mitochondrial protein, transcriptvariant 1, mRNA. (A) pleckstrin homology, Sec7 and coiled-coil domains,PSCDBP 0.4206 72 binding protein (PSCDBP), mRNA. (S) uridinephosphorylase 1 (UPP1), transcript variant 2, UPP1 0.4188 73 mRNA. (A)solute carrier family 35 (CMP-sialic acid transporter), SLC35A1 0.417674 member A1 (SLC35A1), mRNA. (S) mitogen-activated protein kinasekinase kinase 8 MAP3K8 0.4162 75 (MAP3K8), mRNA. (S) chromosome 15 openreading frame 39 (C15orf39), C15orf39 0.411 76 mRNA. (S) ribosomalprotein L35 (RPL35), mRNA. (S) RPL35 0.4106 77 rho/rac guaninenucleotide exchange factor (GEF) 2 ARHGEF2 0.4074 78 (ARHGEF2), mRNA.(S) chromosome 19 open reading frame 37 (C19orf37), C19orf37 0.4072 79mRNA. (S) RNA binding motif protein 14 (RBM14), mRNA. (S) RBM14 0.406880 hypothetical protein MGC7036 (MGC7036), mRNA. (S) MGC7036 0.4056 81poly(A) polymerase alpha (PAPOLA), mRNA. (S) PAPOLA 0.4044 82 RAB10,member RAS oncogene family (RAB10), RAB10 0.403 83 mRNA. (S) chromosome2 open reading frame 28 (C2orf28), C2orf28 0.403 84 transcript variant2, mRNA. (A) LIM domain only 2 (rhombotin-like 1) (LMO2), mRNA. LMO20.3972 85 (S) polymerase (RNA) III (DNA directed) polypeptide G POLR3GL0.3968 86 (32 kD) like (POLR3GL), mRNA. (S) zinc finger and BTB domaincontaining 16 (ZBTB16), ZBTB16 0.3948 87 transcript variant 1, mRNA. (A)eukaryotic translation initiation factor 3, subunit 5 EIF3S5 0.3924 88epsilon, 47 kDa (EIF3S5), mRNA. (S) HSCARG protein (HSCARG), mRNA. (S)HSCARG 0.3916 89 synaptotagmin-like 3 (SYTL3), mRNA. (S) SYTL3 0.3896 90hypothetical protein FLJ32028 (FLJ32028), mRNA. (S) FLJ32028 0.3886 91leucine rich repeat containing 33 (LRRC33), mRNA. (S) LRRC33 0.3862 92chromosome 1 open reading frame 162 (C1orf162), C1orf162 0.3846 93 mRNA.(S) cytochrome P450, family 2, subfamily R, polypeptide 1 CYP2R1 0.384694 (CYP2R1), mRNA. (S) jun D proto-oncogene (JUND), mRNA. (S) JUND 0.38195 melanoma antigen family D, 1 (MAGED1), transcript MAGED1 0.3806 96variant 1, mRNA. (A) autism susceptibility candidate 2 (AUTS2), mRNA.(S) AUTS2 0.3806 97 oligodendrocyte transcription factor 1 (OLIG1),mRNA. OLIG1 0.379 98 (S) eukaryotic translation elongation factor 1delta (guanine EEF1D 0.3776 99 nucleotide exchange protein) (EEF1D),transcript variant 1, mRNA. (A) killer cell lectin-like receptorsubfamily K, member 1 KLRK1 0.3736 100 (KLRK1), mRNA. (S)

TABLE II GENE NAME Symbol Score Rank v-ets erythroblastosis virus E26oncogene homolog ETS1 0.9612 1 1 (avian) (ETS1), mRNA. (S) chemokine(C-C motif) ligand 5 (CCL5), mRNA. CCL5 0.9438 2 (S)DNA-damage-inducible transcript 4 (DDIT4), DDIT4 0.9024 3 mRNA. (S)chemokine (C-X-C motif) receptor 4 (CXCR4), CXCR4 0.8098 4 transcriptvariant 1, mRNA. (A) dynein, cytoplasmic, light polypeptide 1 (DNCL1),DNCL1 0.8058 5 mRNA. (S) membrane-spanning 4-domains, subfamily A,MS4A6A 0.796 6 member 6A (MS4A6A), transcript variant 2, mRNA. (I) ATPsynthase, H+ transporting, mitochondrial F1 ATP5B 0.7754 7 complex, betapolypeptide (ATP5B), nuclear gene encoding mitochondrial protein, mRNA.(S) heat shock 70 kDa protein 8 (HSPA8), transcript HSPA8 0.7718 8variant 1, mRNA. (I) adrenomedullin (ADM), mRNA. (S) ADM 0.7708 9protein tyrosine phosphatase, non-receptor type 6 PTPN6 0.7576 10(PTPN6), transcript variant 3, mRNA. (A) Rho GTPase activating protein 9(ARHGAP9), ARHGAP9 0.7548 11 mRNA. (S) S100 calcium binding protein A8(calgranulin A) S100A8 0.7336 12 (S100A8), mRNA. (S)dihydropyrimidinase-like 2 (DPYSL2), mRNA. (S) DPYSL2 0.724 13 heatshock 70 kDa protein 1A (HSPA1A), mRNA. HSPA1A 0.7156 14 (S) nuclearfactor of kappa light polypeptide gene NFKBIA 0.7132 15 enhancer inB-cells inhibitor, alpha (NFKBIA), mRNA. (S) N-acetylglucosamine kinase(NAGK), mRNA. (S) NAGK 0.7098 16 immunoglobulin superfamily, member 6(IGSF6), IGSF6 0.7088 17 mRNA. (S) major histocompatibility complex,class II, DM HLA-DMB 0.704 18 beta (HLA-DMB), mRNA. (S) family withsequence similarity 100, member B FAM100B 0.7016 19 (FAM100B), mRNA. (S)myosin, light polypeptide 6, alkali, smooth muscle MYL6 0.6962 20 andnon-muscle, transcript variant 1, mRNA. (A) solute carrier family 2(facilitated glucose SLC2A3 0.6738 21 transporter), member 3 (SLC2A3),mRNA. (S) heat shock 70 kDa protein 8 (HSPA8), transcript HSPA8 0.653 22variant 2, mRNA. (A) H2A histone family, member Z (H2AFZ), mRNA. H2AFZ0.6422 23 (S) Kruppel-like factor 9 (KLF9), mRNA. (S) KLF9 0.6354 24tumor necrosis factor, alpha-induced protein 3 TNFAIP3 0.6312 25(TNFAIP3), mRNA. (S) selenoprotein W, 1 (SEPW1), mRNA. (S) SEPW1 0.616426 sorting nexin 2 (SNX2), mRNA. (S) SNX2 0.609 27 dual specificityphosphatase 1 (DUSP1), mRNA. DUSP1 0.6076 28 (S) cystatin F(leukocystatin) (CST7), mRNA. (S) CST7 0.5858 29 PREDICTED similar to60S acidic ribosomal LOC440927 0.5844 30 protein P1, transcript variant4 (LOC440927), mRNA. (A) PR domain containing 1, with ZNF domain PRDM10.581 31 (PRDM1), transcript variant 1, mRNA. (A) cold inducible RNAbinding protein (CIRBP), CIRBP 0.5786 32 mRNA. (S) cat eye syndromechromosome region, candidate 1 CECR1 0.575 33 (CECR1), transcriptvariant 1, mRNA. (A) ATP synthase, H+ transporting, mitochondrial F1ATP5A1 0.5664 34 complex, alpha subunit 1, cardiac muscle (ATP5A1),nuclear gene encoding mitochondrial protein, transcript variant 1, mRNA.(A) LIM domain only 2 (rhombotin-like 1) (LMO2), LMO2 0.5608 35 mRNA.(S) ral guanine nucleotide dissociation stimulator RALGDS 0.5572 36(RALGDS), mRNA. (S) G protein-coupled receptor 171 (GPR171), mRNA.GPR171 0.5536 37 (S) RNA binding motif protein 5 (RBM5), mRNA. (S) RBM50.5532 38 IL2-inducible T-cell kinase (ITK), mRNA. (S) ITK 0.545 39 CTD(carboxy-terminal domain, RNA polymerase CTDSP2 0.542 40 II, polypeptideA) small phosphatase 2, mRNA. (S) general transcription factor MA(GTF3A), mRNA. GTF3A 0.5394 41 (S) myeloid-associated differentiationmarker MYADM 0.5394 42 (MYADM), transcript variant 4, mRNA. (A) NACHT,leucine rich repeat and PYD (pyrin NALP1 0.5384 43 domain) containing 1,transcript variant 5, mRNA. (I) DEAD (Asp-Glu-Ala-Asp) box polypeptide17 DDX17 0.5304 44 (DDX17), transcript variant 2, mRNA. (A)thrombospondin 1 (THBS1), mRNA. (S) THBS1 0.5278 45 arachidonate5-lipoxygenase (ALOX5), mRNA. ALOX5 0.523 46 (A) sparc/osteonectin, cwcvand kazal-like domains SPOCK2 0.5186 47 proteoglycan (testican) 2(SPOCK2), mRNA. (S) hypothetical protein MGC7036 (MGC7036), MGC70360.5182 48 mRNA. (S) phosphoinositide-3-kinase, regulatory subunit 1PIK3R1 0.5176 49 (p85 alpha) (PIK3R1), transcript variant 2, mRNA. (A)myeloid cell nuclear differentiation antigen MNDA 0.5158 50 (MNDA),mRNA. (S) solute carrier family 35 (CMP-sialic acid SLC35A1 0.5142 51transporter), member A1 (SLC35A1), mRNA. (S) chromosome 19 open readingframe 37 (C19orf37), C19orf37 0.514 52 mRNA. (S) granzyme M (lymphocytemet-ase 1) (GZMM), GZMM 0.5066 53 mRNA. (S) transferrin receptor (p90,CD71) (TFRC), mRNA. TFRC 0.5024 54 (S) mixed lineage kinase domain-like(MLKL), MLKL 0.501 55 mRNA. (I) COMM domain containing 3 (COMMD3), mRNA.COMMD3 0.4976 56 (S) RAB24, member RAS oncogene family (RAB24), RAB240.497 57 transcript variant 2, mRNA. (A) PREDICTED similar toheterogeneous nuclear LOC645385 0.4966 58 ribonucleoprotein A1(LOC645385), mRNA. (S) RNA binding motif protein 14 (RBM14), mRNA. RBM140.4948 59 (S) pleckstrin homology, Sec7 and coiled-coil domains PSCD40.4928 60 4 (PSCD4), mRNA. (S) zinc finger, DHHC-type containing 7(ZDHHC7), ZDHHC7 0.489 61 mRNA. (S) protein kinase C, eta (PRKCH), mRNA.(S) PRKCH 0.4886 62 hypothetical protein MGC11257 (MGC11257), MGC112570.4854 63 mRNA. (S) heat shock 105 kDa/110 kDa protein 1 (HSPH1), HSPH10.4812 64 mRNA. (S) retinoid X receptor, alpha (RXRA), mRNA. (S) RXRA0.481 65 bicaudal D homolog 2 (Drosophila) (BICD2), BICD2 0.4756 66transcript variant 1, mRNA. (A) solute carrier family 27 (fatty acidtransporter), SLC27A3 0.47 67 member 3 (SLC27A3), mRNA. (S) CD96 antigen(CD96), transcript variant 1, mRNA. CD96 0.4688 68 (A) ribosomal proteinS2 (RPS2), mRNA. (S) RPS2 0.4662 69 insulin receptor substrate 2 (IRK),mRNA. (S) IRS2 0.4654 70 protein tyrosine phosphatase, non-receptor typePTPNS1 0.4612 71 substrate 1 (PTPNS1), mRNA. (S) ral guanine nucleotidedissociation stimulator-like RGL2 0.457 72 2 (RGL2), mRNA. (S)PREDICTED: similar to Translationally-controlled LOC643870 0.4566 73tumor protein (TCTP) (p23) (Histamine-releasing factor) (HRF) (Fortilin)(LOC643870), mRNA. (S) MIDI interacting protein 1 (gastrulation specificMID1IP1 0.454 74 G12-like (zebrafish)) (MID1IP1), mRNA. (S) solutecarrier family 7 (cationic amino acid SLC7A7 0.4502 75 transporter, y+system), member 7 (SLC7A7), mRNA. (S) FK506 binding protein 11, 19 kDa(FKBP11), FKBP11 0.4492 76 mRNA. (S) SH2 domain containing 3C (SH2D3C),transcript SH2D3C 0.4454 77 variant 2, mRNA. (I) rho/rac guaninenucleotide exchange factor (GEF) ARHGEF2 0.4444 78 2 (ARHGEF2), mRNA.(S) nucleoporin 62 kDa (NUP62), transcript variant 1, NUP62 0.4424 79mRNA. (A) hypothetical protein FLJ20186 (FLJ20186), FLJ20186 0.438 80transcript variant 1, mRNA. (I) ATPase, H+ transporting, lysosomal 56/58kDa, V1 ATP6V1B2 0.436 81 subunit B, isoform 2 (ATP6V1B2), mRNA. (S)v-yes-1 Yamaguchi sarcoma viral related oncogene LYN 0.4358 82 homolog(LYN), mRNA. (S) tumor necrosis factor, alpha-induced protein 2 TNFAIP20.433 83 (TNFAIP2), mRNA. (S) ST3 beta-galactosidealpha-2,3-sialyltransferase 1 ST3GAL1 0.4318 84 (ST3GAL1), transcriptvariant 2, mRNA. (A) GABA(A) receptor-associated protein like 1GABARAPL1 0.4276 85 (GABARAPL1), mRNA. (S) DCP2 decapping enzyme homolog(S. cerevisiae) DCP2 0.4272 86 (DCP2), mRNA. (S) family with sequencesimilarity 46, member A FAM46A 0.4266 87 (FAM46A), mRNA. (S)mitochondrial ribosomal protein L51 (MRPL51), MRPL51 0.4256 89 nucleargene encoding mitochondrial protein, mRNA. (S) chemokine (C-C motif)ligand 4-like 1 (CCL4L1), CCL4L1 0.4208 90 mRNA. (S) deoxyguanosinekinase, nuclear gene encoding DGUOK 0.4204 91 mitochondrial protein,transcript variant 1, mRNA. (A) frequently rearranged in advanced T-cellFRAT2 0.4202 92 lymphomas 2 (FRAT2), mRNA. (S) SH3-domain kinase bindingprotein 1 (SH3KBP1), SH3KBP1 0.4172 93 transcript variant 1, mRNA. (I)dual specificity phosphatase 2 (DUSP2), mRNA. DUSP2 0.4172 94 (S)eukaryotic translation initiation factor 2B, subunit 4 EIF2B4 0.4136 95delta, 67 kDa, transcript variant 1, mRNA. (A) fibrinogen-like 2 (FGL2),mRNA. (S) FGL2 0.4126 96 glucosidase, alpha; neutral AB (GANAB), GANAB0.4112 97 transcript variant 2, mRNA. (A) CCAAT/enhancer binding protein(C/EBP), alpha CEBPA 0.41 98 (CEBPA), mRNA. (S) prolylcarboxypeptidase(angiotensinase C) (PRCP), PRCP 0.4046 99 transcript variant 2, mRNA.(A) succinate-CoA ligase, GDP-forming, beta subunit SUCLG2 0.4012 100(SUCLG2), mRNA. (S)

TABLE III GENE NAME Symbol Score Rank TSC22 domain family, member 3(TSC22D3), transcript TSC22D3 0.9522 1 variant 2, mRNA. (A) chemokine(C-X-C motif) receptor 4 (CXCR4), transcript CXCR4 0.9444 2 variant 1,mRNA. (A) dynein, cytoplasmic, light polypeptide 1 (DNCL1), mRNA. DNCL10.8668 3 (S) ribosomal protein S3 (RPS3), mRNA. (S) RPS3 0.8556 4DNA-damage-inducible transcript 4 (DDIT4), mRNA. (S) DDIT4 0.8502 5granzyme B (granzyme 2, cytotoxic T-lymphocyte- GZMB 0.8148 6 associatedserine esterase 1) (GZMB), mRNA. (S) B-cell translocation gene 1,anti-proliferative (BTG1), BTG1 0.8 7 mRNA. (S) heat shock 70 kDaprotein 8 (HSPA8), transcript variant 1, HSPA8 0.793 8 mRNA. (I)ribosomal protein L12 (RPL12), mRNA. (S) RPL12 0.7564 9 Src-like-adaptor(SLA), mRNA. (S) SLA 0.7322 10 runt-related transcription factor 3(RUNX3), transcript RUNX3 0.7306 11 variant 2, mRNA. (I) HGFL gene(MGC17330), mRNA. (S) MGC17330 0.6982 12 heat shock 70 kDa protein lA(HSPA1A), mRNA. (S) HSPA1A 0.684 13 interleukin 18 receptor accessoryprotein (IL18RAP), IL18RAP 0.6728 14 mRNA. (S) cold inducible RNAbinding protein (CIRBP), mRNA. (S) CIRBP 0.67 15 adrenomedullin (ADM),mRNA. (S) ADM 0.662 16 CCAAT/enhancer binding protein (C/EBP), beta(CEBPB), CEBPB 0.654 17 mRNA. (S) PREDICTED similar to heterogeneousnuclear LOC645385 0.654 18 ribonucleoprotein A1 (LOC645385), mRNA. (S)CCAAT/enhancer binding protein (C/EBP), delta CEBPD 0.6416 19 (CEBPD),mRNA. (S) Kruppel-like factor 9 (KLF9), mRNA. (S) KLF9 0.6392 20PREDICTED: hypothetical protein LOC440345, transcript LOC440345 0.635821 variant 6 (LOC440345), mRNA. (I) inhibitor of DNA binding 2, dominantnegative helix-loop- ID2 0.617 22 helix protein (ID2), mRNA. (S) killercell Ig-like receptor, two domains, long cytoplasmic KIR2DL3 0.6126 23tail 3, transcript variant 2, mRNA. (A) arachidonate5-lipoxygenase-activating protein ALOX5AP 0.6106 24 (ALOX5AP), mRNA. (S)immunoglobulin superfamily, member 6 (IGSF6), mRNA. IGSF6 0.6068 25 (S)heat shock 70 kDa protein 8 (HSPA8), transcript variant 2, HSPA8 0.603227 mRNA. (A) tubulin, alpha, ubiquitous (K-ALPHA-1), mRNA. (S) K-ALPHA-10.6002 28 protein kinase C, delta (PRKCD), transcript variant 2, PRKCD0.5992 29 mRNA. (A) PR domain containing 1, with ZNF domain (PRDM1),PRDM1 0.594 30 transcript variant 1, mRNA. (A) CD55 antigen, decayaccelerating factor for complement CD55 0.5722 31 (Cromer blood group)(CD55), mRNA. (S) cystatin F (leukocystatin) (CST7), mRNA. (S) CST70.5698 32 myeloid-associated differentiation marker (MYADM), MYADM 0.56833 transcript variant 4, mRNA. (A) major histocompatibility complex,class I, F (HLA-F), HLA-F 0.568 34 mRNA. (S) SH2 domain protein 2A(SH2D2A), mRNA. (S) SH2D2A 0.5656 35 potassium channel tetramerisationdomain containing 12 KCTD12 0.5638 36 (KCTD12), mRNA. (S)Ras-GTPase-activating protein SH3-domain-binding G3BP 0.5636 37 protein(G3BP), transcript variant 1, mRNA. (A) fibrinogen-like 2 (FGL2), mRNA.(S) FGL2 0.5552 38 CCAAT/enhancer binding protein (C/EBP), alpha CEBPA0.5368 39 (CEBPA), mRNA. (S) DnaJ (Hsp40) homolog, subfamily A, member 1DNAJA1 0.5306 40 (DNAJA1), mRNA. (S) capping protein (actin filament)muscle Z-line, alpha 2 CAPZA2 0.5244 41 (CAPZA2), mRNA. (S) generaltranscription factor IIIA (GTF3A), mRNA. (S) GTF3A 0.523 42 IBR domaincontaining 2 (IBRDC2), mRNA. (S) IBRDC2 0.5228 43 interferon stimulatedexonuclease gene 20 kDa (ISG20), ISG20 0.5208 44 mRNA. (S) PREDICTEDsimilar to ribosomal protein L13a, transcript LOC649564 0.5134 45variant 4 (LOC649564), mRNA. (A) G protein-coupled receptor 171(GPR171), mRNA. (S) GPR171 0.5124 46 killer cell immunoglobulin-likereceptor, two domains, long KIR2DL4 0.5044 47 cytoplasmic tail, 4(KIR2DL4), mRNA.(S) sin3-associated polypeptide, 30 kDa (SAP30), mRNA.(S) SAP30 0.4972 48 PREDICTED: meteorin, glial cell differentiationregulator- METRNL 0.4936 49 like (METRNL), mRNA. (I) chlorideintracellular channel 3 (CLIC3), mRNA. (S) CLIC3 0.4926 50 eukaryotictranslation initiation factor 3, subunit 12 EIF3S12 0.4912 51 (EIF3S12),mRNA. (S) insulin receptor substrate 2 (IRS2), mRNA. (S) IRS2 0.4824 52hepatitis A virus cellular receptor 2 (HAVCR2), mRNA. HAVCR2 0.4758 53(S) HD domain containing 2 (HDDC2), mRNA. (S) HDDC2 0.4754 54 nuclearRNA export factor 1 (NXF1), mRNA. (S) NXF1 0.468 55 perforin 1 (poreforming protein) (PRF1), mRNA. (S) PRF1 0.4642 56 SAM domain, SH3 domainand nuclear localisation signals, SAMSN1 0.4614 57 1 (SAMSN1), mRNA. (S)TERF1 (TRF1)-interacting nuclear factor 2 (TINF2), TINF2 0.4604 58 mRNA.(S) endoplasmic reticulum-golgi intermediate compartment ERGIC1 0.455459 (ERGIC) 1, transcript variant 1, mRNA. (I) tumor necrosis factor,alpha-induced protein 2 (TNFAIP2), TNFAIP2 0.455 60 mRNA. (S) AT-hooktranscription factor (AKNA), mRNA. (S) AKNA 0.4548 61 adiposedifferentiation-related protein (ADFP), mRNA. (S) ADFP 0.4546 62pyruvate dehydrogenase kinase, isozyme 4 (PDK4), PDK4 0.4538 63 mRNA.(S) apoptotic peptidase activating factor (APAF1), transcript APAF10.4486 64 variant 5, mRNA. (A) signal transducer and activator oftranscription 4 (STAT4), STAT4 0.4478 65 mRNA. (S) aldo-keto reductasefamily 1, member C3 (3-alpha AKR1C3 0.4454 66 hydroxysteroiddehydrogenase, type II), mRNA. (S) SH2 domain containing 3C (SH2D3C),transcript variant 2, SH2D3C 0.4444 67 mRNA. (I) heat shock 105 kDa/110kDa protein 1 (HSPH1), mRNA. (S) HSPH1 0.4396 68phosphoinositide-3-kinase, regulatory subunit 1 (p85 alpha) PIK3R10.4312 69 (PIK3R1), transcript variant 2, mRNA. (A) presenilinassociated, rhomboid-like (PSARL), mRNA. (S) PSARL 0.4284 70deoxyguanosine kinase, nuclear gene encoding DGUOK 0.4272 71mitochondrial protein, transcript variant 1, mRNA. (A) pleckstrinhomology, Sec7 and coiled-coil domains, PSCDBP 0.4206 72 binding protein(PSCDBP), mRNA. (S) uridine phosphorylase 1 (UPP1), transcript variant2, UPP1 0.4188 73 mRNA. (A) solute carrier family 35 (CMP-sialic acidtransporter), SLC35A1 0.4176 74 member A1 (SLC35A1), mRNA. (S)mitogen-activated protein kinase kinase kinase 8 MAP3K8 0.4162 75(MAP3K8), mRNA. (S) chromosome 15 open reading frame 39 (C15orf39),C15orf39 0.411 76 mRNA. (S) ribosomal protein L35 (RPL35), mRNA. (S)RPL35 0.4106 77 rho/rac guanine nucleotide exchange factor (GEF) 2ARHGEF2 0.4074 78 (ARHGEF2), mRNA. (S) chromosome 19 open reading frame37 (C19orf37), C19orf37 0.4072 79 mRNA. (S) RNA binding motif protein 14(RBM14), mRNA. (S) RBM14 0.4068 80 hypothetical protein MGC7036(MGC7036), mRNA. (S) MGC7036 0.4056 81 poly(A) polymerase alpha(PAPOLA), mRNA. (S) PAPOLA 0.4044 82 RAB10, member RAS oncogene family(RAB10), mRNA. RAB10 0.403 83 (S) chromosome 2 open reading frame 28(C2orf28), transcript C2orf28 0.403 84 variant 2, mRNA. (A) LIM domainonly 2 (rhombotin-like 1) (LMO2), mRNA. LMO2 0.3972 85 (S) polymerase(RNA) III (DNA directed) polypeptide G POLR3GL 0.3968 86 (32 kD) like(POLR3GL), mRNA. (S) zinc finger and BTB domain containing 16 (ZBTB16),ZBTB16 0.3948 87 transcript variant 1, mRNA. (A) eukaryotic translationinitiation factor 3, subunit 5 epsilon, EIF3S5 0.3924 88 47 kDa(EIF3S5), mRNA. (S) HSCARG protein (HSCARG), mRNA. (S) HSCARG 0.3916 89synaptotagmin-like 3 (SYTL3), mRNA. (S) SYTL3 0.3896 90 hypotheticalprotein FLJ32028 (FLJ32028), mRNA. (S) FLJ32028 0.3886 91 leucine richrepeat containing 33 (LRRC33), mRNA. (S) LRRC33 0.3862 92 chromosome 1open reading frame 162 (C1orf162), C1orf162 0.3846 93 mRNA. (S)cytochrome P450, family 2, subfamily R, polypeptide 1 CYP2R1 0.3846 94(CYP2R1), mRNA. (S) jun D proto-oncogene (JUND), mRNA. (S) JUND 0.381 95melanoma antigen family D, 1 (MAGED1), transcript MAGED1 0.3806 96variant 1, mRNA. (A) autism susceptibility candidate 2 (AUTS2), mRNA.(S) AUTS2 0.3806 97 oligodendrocyte transcription factor 1 (OLIG1),mRNA. (S) OLIG1 0.379 98 eukaryotic translation elongation factor 1delta (guanine EEF1D 0.3776 99 nucleotide exchange protein) (EEF1D),transcript variant 1, mRNA. (A) killer cell lectin-like receptorsubfamily K, member 1 KLRK1 0.3736 100 (KLRK1), mRNA. (S)

TABLE IV Top 15 Gene Classifiers Rank ALL/NHC AC/NHC PRE/POST 1 IGSF6ETS1 TSC22D3 2 HSPA8(A) CCL5 CXCR4 3 LYN DDIT4 DNCL1 4 DNCL1 CSCR4 RPS35 HSPA1A DNCL1 DDIT4 6 DPYSL2 MS4A6A GZMB 7 NAGK ATP5B BTG1 8 HSPA8(I)HSPA8(A) HSPA8(I) 9 NFKBIA ADM RPL12 10 FGL2 PTPN6 SLA 11 CALM2 ARHGAP9RUNX3 12 CCL5 S100A8 MGC17330 13 RPS2 DPYSL2 HSPA1A 14 DDIT4 HSPA1AIL18RAP 15 C1orf63 NFKBIA CIRBP

TABLE V Fold Spot ID Accession No. GENE NAME Symbol Rank Chg 5490167NM_016578 hepatitis B virus x associated HBXAP or 1 1.27 protein(HBXAP), mRNA. RSF1 (S); or alternatively, called Remodeling andsplicing factor 1 3890735 NM_003583 dual-specificity tyrosine-(Y)- DYRK22 −1.34 phosphorylation regulated kinase 2 (DYRK2), transcript variant1, mRNA. (A) 3840377 NM_003403 YY1 transcription factor YY1 3 −1.08(YY1), mRNA. (S) 1470605 NM_001031726 chromosome 19 open readingC19orf12 4 1.36 frame 12, transcript variant 1, mRNA. (I) 4230709NM_018473 thioesterase superfamily THEM2 5 −1.13 member 2 (THEM2), mRNA.(S) 1430678 NM_007118 triple functional domain TRIO 6 −1.16 (PTPRFinteracting) (TRIO), mRNA. (S) 1340086 NM_001020820 myeloid-associatedMYADM 7 −1.34 differentiation marker, transcript variant 4, mRNA. (A)2940370 NM_017450 BAI1-associated protein 2 BAIAP2 8 −1.34 (BAIAP2),transcript variant 1, mRNA. (I) 6400075 NM_024589 leucine zipper domainprotein FLJ22386 or 9 −1.18 (FLJ22386), mRNA. (S); or ROGDIalternatively Rogdi homolog (Drosophila) 20196 NM_024920 DnaJ (Hsp40)homolog, DNAJB14 10 −1.14 subfamily B, member 14 (DNAJB14), transcriptvariant 2, mRNA. (I) 7330360 NM_199191 brain and reproductive organ- BRE11 1.04 expressed (TNFRSF1A modulator) (BRE), transcript variant 3,mRNA. (A) 240280 NM_080652 transmembrane protein 41A TMEM41A 12 1.15(TMEM41A), mRNA. (S) 3940687 NM_032307 chromosome 9 open reading C9orf6413 −1.14 frame 64 (C9orf64), mRNA. (S) 4150253 NM_031424 chromosome 20open reading C20orf55 or 14 −1.14 frame 55, transcript variant 1,FAM110A mRNA. (A); or alternatively, Family with sequence similarity110, member A 1660445 NM_014801 pecanex-like 2 (Drosophila) PCNXL2 151.21 (PCNXL2), transcript variant 1, mRNA. (I) 4120187 NM_005612RE1-silencing transcription REST 16 1.29 factor (REST), mRNA. (S)7610494 NM_014173 HSPC142 protein (HSPC142), HSPC142 or 17 1.10transcript variant 2, mRNA. C19orf62 (A); or alternatively, Chromosome19 open reading frame 62 4250121 NM_138779 hypothetical protein LOC93081or 18 −1.18 BC015148 (LOC93081), C13orf27 mRNA. (S); or alternatively,Chromosome 13 open reading frame 27 4810674 NM_022091 activating signalcointegrator 1 ASCC3 19 1.83 complex subunit 3 (ASCC3), transcriptvariant 2, mRNA. (A) 3460224 NM_005628 solute carrier family 1 (neutralSLC1A5 20 −1.16 amino acid transporter), member 5 (SLC1A5), mRNA. (S)1110110 NM_016395 protein tyrosine phosphatase- PTPLAD1 21 −1.22 like Adomain containing 1, mRNA. (A) 2630397 NM_005590 MRE11 meiotic MRE11A 22−1.18 recombination 11 homolog A (S. cerevisiae) (MRE11A), transcriptvariant 2, mRNA. (A) 1400541 NM_033107 hypothetical protein DKFZP686A 23−1.27 (DKFZP686A10121), mRNA. 10121 or (S); or alternatively, GTP-GTPBP10 binding protein 10 (putative), transcript variant 2 4390100BX118737 BX118737 Soares fetal liver NaN 24 −1.40 spleen 1NFLS cDNAclone IMAGp998K18127, mRNA sequence (S) 1500246 NM_006217 serpinpeptidase inhibitor, SERPINI2 25 −1.41 Glade I (pancpin), member 2(SERPINI2), transcript variant 2, mRNA. (S) 6590377 AK126342 cDNAFLJ44370 fis, clone NaN or 26 −1.45 TRACH3008902 (S); or CREB1alternatively, CAMP responsive element binding protein 1 3710754NM_016053 coiled-coil domain containing CCDC53 27 −1.07 53 (CCDC53),mRNA. (S) 990112 NM_032236 ubiquitin specific peptidase 48 USP48 28−1.17 (USP48), transcript variant 1, mRNA. (I) 2640255 NM_001007072 zincfinger and SCAN domain ZSCAN2 29 1.18 containing 2, transcript variant3, mRNA (I) 2370482 NM_024754 pentatricopeptide repeat PTCD2 30 domain 2(PTCD2), mRNA. (S) 6380040 NM_025201 pleckstrin homology domain PLEKHQ131 containing, family Q member 1 mRNA. (S) 6370338 AW191734 HIMC10.07.00human islet NaN 32 cDNA differential display cDNA, mRNA sequence (S)5340544 NM_002616 period homolog 1 PER1 33 (Drosophila) (PER1), mRNA.(S) 5910367 NM_012154 eukaryotic translation EIF2C2 34 initiation factor2C, 2 (EIF2C2), mRNA. (S) 2570440 NM_022128 ribokinase (RBKS), mRNA.RBKS 35 (S) 6100707 NM_002419 mitogen-activated protein MAP3K11 36kinase kinase kinase 11, mRNA. (S) 2490615 NM_207443 FLJ45244 protein(FLJ45244), FLJ45244 37 mRNA. (S) 6580368 NM_006611 killer celllectin-like receptor KLRA1 38 subfamily A, member 1, mRNA. (S) 4570553NM_016282 adenylate kinase 3 (AK3), AK3 39 mRNA. (S) 5130500 BG741535602635144F1 NaN 40 NCI_CGAP_Skn3 cDNA clone IMAGE: 4780090 5, mRNAsequence (S) 1240026 NM_001003941 oxoglutarate (alpha- OGDH 41ketoglutarate) dehydrogenase (lipoamide), nuclear gene encodingmitochondrial protein, transcript variant 2, mRNA. (I) 2680593 NM_006582glucocorticoid modulatory GMEB1 42 element binding protein 1 (GMEB1),transcript variant 1, mRNA. (A) 130403 NM_006567 phenylalanine-tRNAFARS2 43 synthetase 2 (mitochondrial) (FARS2), nuclear gene encodingmitochondrial protein, mRNA. (S) 1710338 NM_170768 zinc finger protein91 ZFP91 44 homolog (mouse), transcript variant 2, mRNA. (A) 150021NM_013285 guanine nucleotide binding GNL2 45 protein-like 2 (nucleolar)(GNL2), mRNA. (S) 4250703 XM_498909 PREDICTED: hypothetical LOC440900 46LOC440900 (LOC440900), mRNA. (S) 7000731 NM_020453 ATPase, Class V, type10D ATP10D 47 (ATP10D), mRNA. (S) 4590563 XM_942240 PREDICTED similar toHLA LOC650557 48 class II histocompatibility antigen, DQ (W1.1) betachain precursor (DQB1*0501), transcript variant 1 (LOC650557), mRNA. (A)3310446 NM_018169 chromosome 12 open reading C12orf35 49 frame 35(C12orf35), mRNA. (S) 3460066 XM_932088 PREDICTED: hypotheticalLOC642788 50 protein LOC642788, transcript variant 2 (LOC642788), mRNA.(A) 160152 NM_003789 TNFRSF1A-associated via TRADD 51 death domaintranscript variant 1, mRNA. (A) 840379 NM_031212 solute carrier family25, SLC25A28 52 member 28 (SLC25A28), mRNA. (S) 4050402 BX459101BX459101 PLACENTA NaN 53 cDNA clone CS0DE012YP17 5-PRIME, mRNA sequence(S) 3440441 AK124002 cDNA FLJ42008 fis, clone NaN 54 SPLEN2031724 (S)5390504 NM_001165 baculoviral IAP repeat- BIRC3 55 containing 3,transcript variant 1, mRNA. (I) 5490564 XM_940798 PREDICTED similar toBc1- LOC650759 56 2-associated transcription factor 1 (Btf), transcriptvariant 1 (LOC650759), mRNA. (I) 1940220 XM_940538 PREDICTED: proteintyrosine PTPLAD1 57 phosphatase-like A domain containing 1 (PTPLAD1),mRNA. (A) 770221 NM_005950 metallothionein 1G (MT1G), MT1G 58 mRNA. (S)1500647 NM_005665 ecotropic viral integration site EVI5 59 5 (EVI5),mRNA. (S) 5900730 NM_005813 protein kinase D3 (PRKD3), PRKD3 60 mRNA.(S) 1980689 NM_024029 Yip1 domain family, member YIPF2 61 2 (YIPF2),mRNA. (S) 770253 NM_024076 potassium channel KCTD15 62 tetramerisationdomain containing 15, mRNA. (S) 2260484 NM_022070 amplified in breastcancer 1 ABC1 63 (ABC1), mRNA. (S) 380561 NM_020773 TBC1 domain family,member TBC1D14 64 14 (TBC1D14), mRNA. (S) 780576 NM_014238 kinasesuppressor of ras 1 KSR1 65 (KSR1), mRNA. (S) 240292 BG564169602590145F1 NIH_MGC_76 NaN 66 cDNA clone IMAGE: 4724074 5, mRNA sequence(S) 6590021 NM_024804 zinc finger protein 669 ZNF669 67 (ZNF669), mRNA.(S) 6330471 NM_004337 chromosome 8 open reading C8orf1 68 frame 1(C8orf1), mRNA. (S) 3170398 NM_000747 cholinergic receptor, nicotinic,CHRNB1 69 beta 1 (muscle) (CHRNB1), mRNA. (S) 3170477 NM_001004489olfactory receptor, family 2, OR2AG1 70 subfamily AG, member 1, mRNA.(S) 2510563 NM_024874 KIAA0319-like KIAA0319L 71 (KIAA0319L), transcriptvariant 1, mRNA. (I) 2510280 NM_015106 RAD54-like 2 (S. cerevisiae)RAD54L2 72 (RAD54L2), mRNA. (S) 4670685 NM_003557phosphatidylinositol-4- PIP5K1A 73 phosphate 5-kinase, type I, alpha,mRNA. (S) 4230736 NM_001329 C-terminal binding protein 2 CTBP2 74(CTBP2), transcript variant 1, mRNA. (I) 7510164 XM_938545 PREDICTED:similar to LOC648039 75 Formin-binding protein 3 (Formin-binding protein11) (FBP 11), transcript variant 1 (LOC648039), mRNA. (I) 4210576NM_022490 polymerase (RNA) I PRAF1 76 associated factor 1 (PRAF1), mRNA.(S) 5910376 NM_003246 thrombospondin 1 (THBS1), THBS1 77 mRNA. (S)2480202 NM_006933 solute carrier family 5 SLC5A3 78 (inositoltransporters), member 3, mRNA. (S) 5960035 NM_170699 G protein-coupledbile acid GPBAR1 79 receptor 1 (GPBAR1), mRNA. (S) 5290192 CR616845full-length cDNA clone NaN 80 CS0DF020YJ04 of Fetal brain of (human) (S)1170301 NM_014572 LATS, large tumor suppressor, LATS2 81 homolog 2(Drosophila), mRNA. (S) 2340224 NM_181724 transmembrane protein 119TMEM119 82 (TMEM119), mRNA. (S) 4210008 NM_022168 interferon inducedwith IFIH1 83 helicase C domain 1 (IFIH1), mRNA. (S) 3060563 CD639673AGENCOURT_14534956 NaN 84 NIH_MGC_191 cDNA clone IMAGE: 30418908 5, mRNAsequence (S) 7320600 AK123531 cDNA F1141537 fis, clone NaN 85BRTHA2017985 (S) 520097 NM_003541 histone 1, H4k (HIST1H4K), HIST1H4K 86mRNA. (S) 5270315 NM_001240 cyclin T1 (CCNT1), mRNA. CCNT1 87 (S)2690008 BC025734 Homo sapiens, clone NaN 88 IMAGE: 5204729, mRNA (S)110044 NM_001001795 similar to RIKEN cDNA MGC70857 89 C030006K11 gene(MGC70857), mRNA. (S) 2030487 BX118124 BX118124 NaN 90Soares_parathyroid_tumor_Nb HPA cDNA clone IMAGp998P234189, mRNAsequence (S) 1170139 NM_033141 mitogen-activated protein MAP3K9 91kinase kinase kinase 9 (MAP3K9), mRNA. (S) 1190300 NM_015353 potassiumchannel KCTD2 92 tetramerisation domain containing 2, mRNA. (I) 4760543NM_153719 nucleoporin 62 kDa (NUP62), NUP62 93 transcript variant 1,mRNA. (A) 7150564 NM_003171 suppressor of var1, 3-like 1 (S. SUPV3L1 94cerevisiae) (SUPV3L1), mRNA. (S) 5820475 NM_002690 polymerase (DNAdirected), POLB 95 beta (POLB), mRNA. (S) 870563 NM_014710 Gprotein-coupled receptor GPRASP1 96 associated sorting protein 1, mRNA.(S) 4640202 AW962976 EST375049 MAGE NaN 97 resequences, MAGH cDNA, mRNAsequence (S) 4250332 XM_932676 PREDICTED: similar to LOC645367 98 Gamma-glutamyltranspeptidase 1 precursor (Gamma- glutamyltransferase 1) (CD224antigen), transcript variant 3 (LOC645367), mRNA. (I) 2570017 NM_023034Wolf-Hirschhorn syndrome WHSC1L1 99 candidate 1-like 1 (WHSC1L1),transcript variant long, mRNA. (I) 3390458 NM_002243 potassiuminwardly-rectifying KCNJ15 100 channel, subfamily J, member 15 (KCNJ15),transcript variant 2, mRNA. (A) 5360053 XM_926644 PREDICTED: similar toLOC643298 101 Thyroid hormone receptor- associated protein complex 240kDa component (Trap240) (Thyroid hormone receptor associated protein 1)(Vitamin D3 receptor-interacting protein complex component DRIP250)(DRIP 250) (Activator-recruited cofactor. . . (LOC643298), mRNA. (S)6760653 XM_935750 PREDICTED: similar to ETS LOC641976 102 domain proteinElk-1 (LOC641976), mRNA. (S) 3800615 NM_080549 protein tyrosinephosphatase, PTPN6 103 non-receptor type 6 (PTPN6), transcript variant3, mRNA. (I) 5310452 NM_153645 nucleoporin 50 kDa (NUP50), NUP50 104transcript variant 3, mRNA. (A) 3850288 XM_934211 PREDICTED: similar toLOC653471 105 Ribosome biogenesis protein BMS1 homolog, transcriptvariant 2 (LOC653471), mRNA. (I) 7560538 NM_153209 kinesin family member19 KIF19 106 (KIF19), mRNA. (S) 6250338 NM_152371 chromosome 1 openreading C1orf93 107 frame 93 (C1orf93), mRNA. (S) 3360382 NM_001625adenylate kinase 2 (AK2), AK2 108 transcript variant AK2A, mRNA. (A)6960564 NM_030934 chromosome 1 open reading C1orf25 109 frame 25(C1orf25), mRNA. (S) 1820131 XM_945571 PREDICTED: ankyrin repeatANKRD13D 110 domain 13 family, member D, transcript variant 7(ANKRD13D), mRNA. (I) 3850255 NM_001238 cyclin E1 (CCNE1), transcriptCCNE1 111 variant 1, mRNA. (A) 990523 NM_006799 protease, serine, 21(testisin) PRSS21 112 (PRSS21), transcript variant 1, mRNA. (A) 4280577NM_006749 solute carrier family 20 SLC20A2 113 (phosphate transporter),member 2, mRNA. (S) 7160368 BC039681 Homo sapiens, clone NaN 114 IMAGE:5218705, mRNA (S) 6020500 NM_024923 nucleoporin 210 kDa NUP210 115(NUP210), mRNA. (S) 2360253 NM_007041 arginyltransferase 1 (ATE1), ATE1116 transcript variant 2, mRNA. (I) 160372 NM_006761 tyrosine 3- YWHAE117 monooxygenase/tryptophan 5- monooxygenase activation protein,epsilon polypeptide (YWHAE), mRNA. (I) 3370170 BX093763 BX093763 NaN 118Soares_fetal_heart_NbHH19 W cDNA clone IMAGp998N10870, mRNA sequence (S)60546 AK057981 cDNA FLJ25252 fis, clone NaN 119 STM03814 (S) 1710411XM_374029 PREDICTED: hypothetical NaN 120 LOC389089 (LOC389089), mRNA(S) 6900315 NM_017958 pleckstrin homology domain PLEKHB2 121 containing,family B (evectins) member 2 (PLEKHB2), transcript variant 2, mRNA. (I)1240603 NM_000887 integrin, alpha X (complement ITGAX 122 component 3receptor 4 subunit), mRNA(I) 60707 NM_001119 adducin 1 (alpha) (ADD1),ADD1 123 transcript variant 1, mRNA. (A) 7160707 NM_198285 hypotheticalprotein LOC349136 124 LOC349136 (LOC349136), mRNA. (S) 2970332 NM_006328RNA binding motif protein 14 RBM14 125 (RBM14), mRNA. (S) 2760433NM_173564 hypothetical protein FLJ37538 FLJ37538 126 (FLJ37538), mRNA.(S) 580041 NM_001252 tumor necrosis factor (ligand) TNFSF7 127superfamily, member 7, mRNA. (S) 4120133 NM_022827 spermatogenesisassociated 20 SPATA20 128 (SPATA20), mRNA. (S) 6560647 NM_018696 elaChomolog 1 (E. coli) ELAC1 129 (ELAC1), mRNA. (S) 4180195 NM_001001520hepatoma-derived growth HDGF2 130 factor-related protein 2 (HDGF2),transcript variant 1, mRNA. (A) 6650020 NM_001124 adrenomedullin (ADM),ADM 131 mRNA. (S) 2750364 NM_020847 trinucleotide repeat containingTNRC6A 132 6A, transcript variant 2, mRNA. (I) 1850682 NM_015530 golgireassembly stacking GORASP2 133 protein 2, 55 kDa (GORASP2), mRNA. (S)50414 NM_006973 zinc finger protein 32 (KOX ZNF32 134 30) (ZNF32),transcript variant 1, mRNA. (A) 7200373 NM_194310 hypothetical proteinLOC284837 135 LOC284837 (LOC284837), mRNA. (S) 3940215 NM_015453 THUMPdomain containing 3 THUMPD3 136 (THUMPD3), mRNA. (S)

TABLE VI POST/ PRE Illumina fold Rank SpotID Acc No. Name Symbol p-valuechg 1 3370291 NM_024514 Cytochrome P450, CYP2R1 0.00000 −1.39 family 2,subfamily R, polypeptide 1 2 6660437 NM_006111 Acetyl-Coenzyme A MYO5B0.00001 −1.34 acyltransferase 2 3 6380402 NM_080915 deoxyguanosine DGUOK0.00000 −1.82 kinase (DGUOK), nuclear gene encoding mitochondrialprotein, transcript variant 5, mRNA. (I) 4 1990500 NM_003746 Dynein,light chain, DYNLL1 0.00002 1.38 LC8-type 1 5 150048 NM_052873Chromosome 14 open C14orf179 0.00001 −1.30 reading frame 179 6 2230731NM_017745 BCL6 co-repressor BCOR 0.00002 1.35 7 270070 BF4486937n93b04.x1 NaN 0.00001 −1.57 NCI_CGAP_Ov18 cDNA clone IMAGE: 3571927 3,mRNA sequence (S) 8 6560482 NM_001280 Cold inducible RNA CIRBP 0.00000−1.35 binding protein 9 2970332 NM_006328 RNA binding motif RBM140.00006 1.25 protein 14 10 3890682 NM_003975 SH2 domain protein SH2D2A0.00000 −1.66 2A 11 6560349 NM_018425 Phosphatidylinositol PI4K2A0.00005 1.37 4-kinase type 2 alpha 12 1710411 XM_374029 PREDICTED: NaN0.00007 −1.44 hypothetical LOC389089, mRNA (S) 13 1660019 NM_001876Carnitine CPT1A 0.00003 −1.33 palmitoyltransferase 1A (liver) 14 2680161NM_006584 Chaperonin containing CCT6B 0.00002 −1.58 TCP1, subunit 6B(zeta 2) 15 4060270 BC009563 Homo sapiens, clone NaN 0.00006 −1.38IMAGE: 3901628, mRNA (S) 16 2650152 NM_020698 Transmembrane and TMCC30.00014 −1.86 coiled-coil domain family 3 17 20451 NM_148976 Proteasome(prosome, PSMA1 0.00040 −1.49 macropain) subunit, alpha type, 1 186220672 NM_001031711 Endoplasmic ERGIC1 0.00055 −1.35 reticulum-golgiintermediate compartment 1 19 6840017 XM_941287 PREDICTED: soluteSLC25A20 0.00159 −1.24 carrier family 25 (carnitine/acylcarnitinetranslocase), member 20 (SLC25A20), mRNA. (A) 20 870709 NM_006133Diacylglycerol lipase, DAGLA 0.00086 1.40 alpha 21 5860148 NM_007320 RANbinding protein 3 RANBP3 0.00179 −1.38 22 20707 NM_207584 Interferon(alpha, beta IFNAR2 0.00025 −1.25 and omega) receptor 2 23 5900156NM_006082 Tubulin, alpha 1b TUBA1B 0.00268 1.13 24 6480170 NM_001005333Melanoma antigen MAGED1 0.00001 −1.27 family D, 1 25 4010605NM_001008739 Similar to RIKEN LOC441150 0.00007 −1.24 cDNA 2310039H08 267210192 NM_003123 Sialophorin SPN 0.00014 −1.89 (leukosialin, CD43) 274260148 XM_371534 PREDICTED: similar LOC389000 0.00075 −1.33 toCG10806-PB, isoform B, mRNA. (A) 28 6560020 NM_017651 Abelson helperAHI1 0.00379 −1.33 integration site 1 29 6480661 NM_002255 Killer cellIg-like KIR2DL4 0.00117 −2.02 receptor, two domains, long cytoplasmictail, 4 30 650753 NM_006712 Fas-activated FASTK 0.00003 −1.40serine/threonine kinase 31 1230528 NM_006644 Heat shock HSPH1 0.000061.47 105 kDa/110 kDa protein 1 32 6420086 NM_001539 DnaJ (Hsp40) DNAJA10.00009 1.26 homolog, subfamily A, member 1 33 4120092 NM_018244Ubiquinol-cytochrome UQCC 0.00286 −1.40 c reductase complex chaperone,CBP3 homolog (yeast) 34 4250438 NM_145267 Chromosome 6 open C6orf570.00188 −1.15 reading frame 57 35 5860477 NM_005226 Sphingosine-1- S1PR30.00017 1.69 phosphate receptor 3 36 5910037 NM_182757 Ring finger 144BRNF144B 0.00000 −1.97 37 6020707 NM_003416 Zinc finger protein 7 ZNF70.00023 −1.14 38 4260497 NM_018179 Activating ATF7IP 0.00092 1.40transcription factor 7 interacting protein 39 2760068 NM_005489 SH2domain SH2D3C 0.00007 1.34 containing 3C 40 6250056 NM_152832 Familywith sequence FAM89B 0.00043 1.21 similarity 89, member B 41 6040273BX115698 BX115698 NaN 0.00031 −1.37 Soares_testis_NHT cDNA cloneIMAGp998M211829, mRNA sequence (S) 42 1990100 XM_930024 PREDICTED:LOC132241 0.00005 −1.21 hypothetical protein LOC132241, transcriptvariant 2 (LOC132241), mRNA. (A) 43 2640066 NM_001008910Serine/threonine STK16 0.00000 −1.90 kinase 16 44 770605 NM_145271 Zincfinger protein ZNF688 0.00000 −1.58 688 45 7200356 NM_001008541 MAXinteractor 1 MXI1 0.00192 1.55 46 1690709 NM_024815 Nudix (nucleosideNUDT18 0.00167 −1.20 diphosphate linked moiety X)-type motif 18 471300743 NM_004089 TSC22 domain family, TSC22D3 0.00003 −1.40 member 3 482100201 NM_015558 synovial sarcoma SS18L1 0.00008 1.19 translocationgene on chromosome 18-like 1 (SS18L1), transcript variant 2, mRNA. (A)49 1820209 NM_001659 ADP-ribosylation ARF3 0.00090 1.19 factor 3 501780762 NM_032847 Chromosome 8 open C8orf76 0.00037 −1.15 reading frame76

TABLE VII Gene Name Fold # Rank ID Acc No Description Symbol p-value Chg1 4880431 NM_181738 Peroxiredoxin 2 PRDX2 0.00000 1.42 2 16 4120187NM_005612 RE1-silencing REST 0.00034 1.40 transcription factor 3 4590563XM_942240 PREDICTED: LOC650557 0.00042 −2.41 similar to HLA class IIhistocompatibility antigen, DQ(W1.1) beta chain precursor (DQB1*0501),transcript variant 1 4 7210129 NM_178025 gamma- GGTL3 0.00018 1.35glutamyltransferase- like 3 (GGTL3), transcript variant 2 5 19 4810674NM_022091 Activating signal ASCC3 0.00274 1.73 cointegrator 1 complexsubunit 3 6 4280722 NM_005481 Mediator complex MED16 0.00027 1.22subunit 16 7 23 1400541 NM_033107 GTP-binding protein GTPBP10 0.00559−1.26 10 (putative) 8 1190022 NM_176895 Phosphatidic acid PPAP2A 0.003551.20 phosphatase type 2A 9 3060692 NM_001010935 RAP1A, member of RAP1A0.00018 −1.35 RAS oncogene family 10 2570440 NM_022128 Brain and BRE0.00282 1.10 reproductive organ- expressed (TNFRSF1A modulator) 114060138 XM_941904 PREDICTED: LOC652455 0.00029 1.47 similar toTranscriptional regulator ATRX (X- linked helicase II) (X-linked nuclearprotein) (XNP) 12 6180296 NM_001017969 KIAA2026 KIAA2026 0.00028 −1.1513 1430292 NM_000578 Solute carrier family SLC11A1 0.00006 −1.41 11(proton-coupled divalent metal ion transporters), member 1 14 110112NM_005701 Snurportin 1 SNUPN 0.00033 −1.17 15 6330471 NM_004337Oxidative stress OSGIN2 0.00204 −1.09 induced growth inhibitor familymember 2 16 5050019 XM_945607 PREDICTED: SPG21 0.02419 1.13 spasticparaplegia 21 (autosomal recessive, Mast syndrome), transcript variant 3(SPG21), mRNA 17 4 1470605 NM_001031726 Chromosome 19 C19orf12 0.003821.43 open reading frame 12 18 6620224 NM_001024662 Ribosomal proteinRPL6 0.00350 −1.03 L6 19 4250133 NM_005188 Cas-Br-M (murine) CBL 0.00001−1.18 ecotropic retroviral transforming seq. 20 9 6400075 NM_024589Rogdi homolog ROGDI 0.00023 −1.32 (Drosophila) 21 6580491 NM_0010158803′-phosphoadenosine PAPSS2 0.00341 −1.34 5′-phosphosulfate synthase 2 228 2940370 NM_017450 BAI1-associated BAIAP2 0.00046 −1.38 protein 2 233360026 NM_017911 Family with FAM118A 0.01598 1.94 sequence similarity118, member A 24 6 1430678 NM_007118 Triple functional TRIO 0.00001−1.31 domain (PTPRF interacting)

For use in the above-noted compositions the PCR primers and probes arepreferably designed based upon intron sequences present in the gene(s)to be amplified selected from the gene expression profile. The design ofthe primer and probe sequences is within the skill of the art once theparticular gene target is selected. The particular methods selected forthe primer and probe design and the particular primer and probesequences are not limiting features of these compositions. A readyexplanation of primer and probe design techniques available to those ofskill in the art is summarized in U.S. Pat. No. 7,081,340, withreference to publically available tools such as DNA BLAST software, theRepeat Masker program (Baylor College of Medicine), Primer Express(Applied Biosystems); MGB assay-by-design (Applied Biosystems); Primer3(Steve Rozen and Helen J. Skaletsky (2000) Primer3 on the WWW forgeneral users and for biologist programmers⁸⁵ and otherpublications^(86,87,88).

In general, optimal PCR primers and probes used in the compositionsdescribed herein are generally 17-30 bases in length, and contain about20-80%, such as, for example, about 50-60% G+C bases. Meltingtemperatures of between 50 and 80° C., e.g. about 50 to 70° C. aretypically preferred.

In another aspect, a composition for diagnosing lung cancer in amammalian subject contains a plurality of polynucleotides immobilized ona substrate, wherein the plurality of genomic probes hybridize to threeor more gene expression products of three or more informative genesselected from a gene expression profile in the peripheral bloodmononuclear cells (PBMC) of the subject, the gene expression profilecomprising genes selected from Table I through Table VII. This type ofcomposition relies on recognition of the same gene profiles as describedabove for the PCR compositions but employs the techniques of a cDNAarray. Hybridization of the immobilized polynucleotides in thecomposition to the gene expression products present in the PBMC of thepatient subject is employed to quantitate the expression of theinformative genes selected from among the genes identified in Tables Ithrough VII to generate a gene expression profile for the patient, whichis then compared to that of a reference sample. As described above,depending upon the identification of the profile (i.e., that of genes ofTable I, II, III, IV, V, VI or VII or subsets thereof), this compositionenables the diagnosis and prognosis of NSCLC lung cancers. Again, theselection of the polynucleotide sequences, their length and labels usedin the composition are routine determinations made by one of skill inthe art in view of the teachings of which genes can form the geneexpression profiles suitable for the diagnosis and prognosis of lungcancers.

The composition, which can be presented in the format of a microfluidicscard, a microarray, a chip or chamber, employs the polynucleotidehybridization techniques described herein. When a sample of PBMC from aselected patent subject is contacted with the hybridization probes inthe composition, PCR amplification of targeted informative genes in thegene expression profile from the patient permits detection andquantification of changes in expression in the genes in the geneexpression profile from that of a reference gene expression profile.Significant changes in the gene expression of the informative genes inthe patient's PBMC from that of the reference gene expression profilecorrelate with a diagnosis of non-small cell lung cancer (NSCLC).

In yet another aspect, a composition or kit useful in the methodsdescribed herein contain a plurality of ligands that bind to three ormore gene expression products of three or more informative genesselected from a gene expression profile in the peripheral bloodmononuclear cells (PBMC) of the subject. The gene expression profilecontains the genes of any of Tables I through VII, as described abovefor the other compositions. This composition enables detection of theproteins expressed by the genes in the indicated Tables. Whilepreferably the ligands are antibodies to the proteins encoded by thegenes in the profile, it would be evident to one of skill in the artthat various forms of antibody, e.g., polyclonal, monoclonal,recombinant, chimeric, as well as fragments and components (e.g., CDRs,single chain variable regions, etc.) may be used in place of antibodies.Such ligands may be immobilized on suitable substrates for contact withthe subject's PBMC and analyzed in a conventional fashion. In certainembodiments, the ligands are associated with detectable labels. Thesecompositions also enable detection of changes in proteins encoded by thegenes in the gene expression profile from those of a reference geneexpression profile. Such changes correlate with lung cancer, e.g.,NSCLC, or diagnosis of cancer stage or type, or pre/post surgical statusand prognosis in a manner similar to that for the PCR andpolynucleotide-containing compositions described above.

In yet a further aspect, a useful composition can contain a plurality ofgene expression products of three or more informative genes selectedfrom the gene expression profile in the peripheral blood mononuclearcells (PBMC) of the subject immobilized on a substrate for detection orquantification of antibodies to the proteins encoded by the genes of theprofiles in the PBMC of a subject. The gene expression profiles includegenes selected from any of Tables I through VII, or subsets thereof,such as the 29 gene classifier of Table V (genes ranked 1-29). This typeof composition, directed at detecting antibodies to the products of thegenes is also useful in identifying and quantitatively detecting changesin expression in the genes in the gene expression profile from that of areference gene expression profile for the same reasons identified abovefor the PCR/polynucleotide-containing compositions. As with the othercompositions, this type of composition correlates the expression levelsof the proteins encoded by the informative genes in the patient's PMBCswith those of a reference control. Significant changes are indicative ofa diagnosis of a lung cancer, are useful for monitoringsurgical/therapeutic intervention in the disease, and/or for providing aprognosis of same.

For all of the above forms of diagnostic/prognostic compositions, thegene expression profile can, in one embodiment, include at least thefirst 5 of the informative genes of any of Tables I through VII orsubsets thereof. In another embodiment for all of the above forms ofdiagnostic/prognostic compositions, the gene expression profile caninclude 10 or more of the informative genes of any of Tables I throughVII or subsets thereof. In another embodiment for all of the above formsof diagnostic/prognostic compositions, the gene expression profile caninclude 15 or more of the informative genes of any of Tables I throughVII or subsets thereof. In another embodiment for all of the above formsof diagnostic/prognostic compositions, the gene expression profile caninclude 24 or more of the informative genes of any of Tables I throughIII, and V-VII or subsets thereof. In another embodiment for all of theabove forms of diagnostic/prognostic compositions, the gene expressionprofile can include 30 to 50 or more of the informative genes of any ofTables I-III, V and VII or subsets thereof.

These compositions may be used to diagnose lung cancers, such as stage Ior stage II NSCLC. Further these compositions are useful to provide asupplemental or original diagnosis in a subject having lung nodules ofunknown etiology. The gene expression profiles formed by genes selectedfrom any of Tables I-VII or subsets thereof are distinguishable from aninflammatory gene expression profile. Further, various embodiments ofthese compositions can utilize reference gene expression profilesincluding three or more informative genes of any of Tables I-VII orsubsets thereof from the PBMC of one or a combination of classes ofreference human subjects. Classes of the reference subjects can includea smoker with malignant disease, a smoker with non-malignant disease, aformer smoker with non-malignant disease, a healthy non-smoker with nodisease, a non-smoker who has chronic obstructive pulmonary disease(COPD), a former smoker with COPD, a subject with a solid lung tumorprior to surgery for removal or same; a subject with a solid lung tumorfollowing surgical removal of the tumor; a subject with a solid lungtumor prior to therapy for same; and a subject with a solid lung tumorduring or following therapy for same. Selection of the appropriate classdepends upon the use of the composition, i.e., for original diagnosis,for prognosis following therapy or surgery or for specific diagnosis ofdisease type, e.g., AC vs. LSCC.

IV. DIAGNOSTIC METHODS OF THE INVENTION

All of the above-described compositions provide a variety of diagnostictools which permit a blood-based, non-invasive assessment of diseasestatus in a subject. Use of these compositions in diagnostic tests,which may be coupled with other screening tests, such as a chest X-rayor CT scan, increase diagnostic accuracy and/or direct additionaltesting. In other aspects, the diagnostic compositions and toolsdescribed herein permit the prognosis of disease, monitoring response tospecific therapies, and regular assessment of the risk of recurrence.The methods and use of the compositions described herein also permit theevaluation of changes in diagnostic signatures present in pre-surgeryand post therapy samples and identifies a gene expression profile orsignature that reflects tumor presence and may be used to assess theprobability of recurrence. The results on pre-post surgery lung canceridentified in the examples below support a similar detectable effect ofthe tumor on gene expression in patient PBMCs.

Thus, in one aspect, a method is provided for diagnosing lung cancer ina mammalian subject. This method involves identifying a gene expressionprofile in the peripheral blood mononuclear cells (PBMC) of a mammalian,preferably human, subject. The gene expression profile includes three ormore gene expression products of three or more informative genes havingincreased or decreased expression in lung cancer. The gene expressionprofiles are formed by selection of three or more informative genes fromthe genes of any of Tables I-VII or subsets thereof. Comparison of asubject's gene expression profile with a reference gene expressionprofile permits identification of changes in expression of theinformative genes that correlate with a lung cancer (e.g., NSCLC). Thismethod may be performed using any of the compositions described above.

In one embodiment, the method enables the diagnosis of adenocarcinomaspecifically. For this purpose, the gene expression profile is desirablyselected from the genes of Table II. In another embodiment, the methodenables the diagnosis of stage I or II NSCLC. For this purpose, the geneexpression profile is desirably formed of three or more genes of Table Ior Table V, including the 29 gene classifier.

As described above for the compositions, the gene profiles optionallyinvolve 5, 6, 10, 15, 25, and greater than 30 informative genes from therespective tables, and can utilize any of the diagnostic method formatsreferred to herein.

As yet another aspect, a method is provided for predicting thelikelihood of recurrence of lung cancer in a mammalian subject. Thismethod includes identifying a gene expression profile in the peripheralblood mononuclear cells (PBMC) of the subject after solid tumorresection or chemotherapy. For this purpose, the gene expression profileincludes three or more gene expression products of three or moreinformative genes of Table III or Table VI. In another embodiment, thegene expression products include the top ranked 2 or 4 genes of TableVI. In one embodiment, the gene expression products are those of the topsix genes of Table III or VI. In another embodiment, the gene expressionproducts include at least 10 or 15 of the top ranked genes of Table IIIor VI. Still other combinations of the genes of Table III or VI areuseful in forming a gene expression profile for this purpose. Thesubject's post-surgical or post-therapeutic gene expression profile iscompared with said subject's pre-surgical or pre-therapeutic geneexpression profile. Significant changes in expression of saidinformative genes correlate with a decreased likelihood of recurrence.Maintenance of the changed gene profile expression over time isindicative of low recurrence post-surgery or post-therapy. As indicatedin the examples below, this change is identifiable in the PBMC of asubject that has a background of smoking and/or has chronic obstructivepulmonary disease (COPD). As stated above, this method may be performedusing the diagnostic compositions and general methodologies describedelsewhere in this specification.

The diagnostic compositions and methods described herein provide avariety of advantages over current diagnostic methods. Among suchadvantages are the following. As exemplified herein, subjects withadenocarcinoma or squamous cell carcinoma of the lung, the two mostcommon types of lung cancer, are distinguished from subjects withnon-malignant lung diseases including chronic obstructive lung disease(COPD) or granuloma or other benign tumors. These methods andcompositions provide a solution to the practical diagnostic problem ofwhether a patient who presents at a lung clinic with a small nodule hasmalignant disease. Patients with an intermediate-risk nodule wouldclearly benefit from a non-invasive test that would move the patientinto either a very low-likelihood or a very high-likelihood category ofdisease risk. An accurate estimate of malignancy based on a genomicprofile (i.e. estimating a given patient has a 90% probability of havingcancer versus estimating the patient has only a 5% chance of havingcancer) would result in fewer surgeries for benign disease, more earlystage tumors removed at a curable stage, fewer follow-up CT scans, andreduction of the significant psychological costs of worrying about anodule. The economic impact would also likely be significant, such asreducing the current estimated cost of additional health care associatedwith CT screening for lung cancer, i.e., $116,000 per quality adjustedlife-year gained. A non-invasive PBMC genomics test that has asufficient sensitivity and specificity would significantly alter thepost-test probability of malignancy and thus, the subsequent clinicalcare.

A desirable advantage of these methods over existing methods is thatthey are able to characterize the disease state from aminimally-invasive procedure, i.e., by taking a blood sample. Incontrast current practice for classification of cancer tumors from geneexpression profiles depends on a tissue sample, usually a sample from atumor. In the case of very small tumors a biopsy is problematic andclearly if no tumor is known or visible, a sample from it is impossible.No purification of tumor is required, as is the case when tumor samplesare analyzed. A recently published method depends on brushing epithelialcells from the lung during bronchoscopy, a method which is alsoconsiderably more invasive than taking a blood sample, and applicableonly to lung cancers, while the methods described herein aregeneralizable to any cancer. Blood samples have an additional advantage,which is that the material is easily prepared and stabilized for lateranalysis, which is important when messenger RNA is to be analyzed.

In one embodiment of the methods described herein is the use of newalgorithms for analyzing the gene expression profiles, which aresuperior for classification to existing algorithms especially in theanalysis of noisy or low signal/noise data. When comparing a generalizeddisease to a generalized non-disease state, the data is likely to benoisy because many different subclasses are being combined in thecomparison. This method could be used as an adjunct to existingdiagnosis of lung disease at any pulmonary clinic.

V. EXAMPLES

The invention is now described with reference to the following examples.These examples are provided for the purpose of illustration only and theinvention should in no way be construed as being limited to theseexamples but rather should be construed to encompass any and allvariations that become evident as a result of the teaching providedherein.

Example 1 Patient Subject and Control Subjects for PBMC Samples

PBMC samples and clinical information were collected from 300 lungcancer patients and 150 controls, including samples from 16 patientscollected pre- and post-surgery. Patient subjects and control subjectsboth have the key risk factor for lung cancer, i.e., smoking, and manyof the patient subjects and non-healthy controls (NHCs) havesmoking-related diseases such as COPD. The major difference between the2 classes is the presence of a malignant nodule in the patient class.

A. Patient Subjects

Patient populations useful in providing data for the development of thegene expression profiles described herein include newly diagnosed maleand female patients with early stage lung cancer. Inclusion criteria forselection of these patients were patients a representative number ofAfrican-American patients (about 15%), Hispanics (5%), and no PacificIslanders. The age range of the patients was from 50-80 years. They werein moderately good health (ambulatory), although with medical illness.They were excluded if they have had previous cancers, chemotherapy,radiation, or cancer surgery. They must have had a lung cancer diagnosiswithin preceding 6 months, histologic confirmation, and no systemictherapy, such as chemotherapy, radiation therapy or cancer surgery asbiomarker levels may change with therapy. Thus the majority of thecancer patients were early stage (i.e., Stage I and Stage II).

Another group of patients was those cancer patients in which blood wasobtained before surgery and then again at a reasonable intervalpost-surgery (˜2-6 months) to ensure that any acutesurgical/inflammatory changes have resolved. This allows each patient toserve as his “own control”. Inclusion criteria were patients with adiagnosis of Stage I or II lung cancer that is surgically resectible.They were excluded if they have had previous cancers, chemotherapy,radiation, or cancer surgery. Data was collected on 16 pairs of pre vs.post surgery samples that were analyzed on the Illumina platform. Thesestudies show a loss of tumor signature post surgery in 13 of the 16pairs tested supporting the detection of a tumor-induced signature inthe peripheral blood samples monitored.

B. Control Subjects

Rather than using matched healthy controls (non-smokers or “healthy”smokers), the control cohort was derived primarily from matched at-riskpulmonary patients (smokers and ex-smokers) with non-malignant lungdisease and patients with benign lung nodules (e.g. granulomas orhamartomas). The control group is referred to here as “non-healthycontrols” (NHC). These patients were evaluated at pulmonary clinics, orunderwent thoracic surgery for a lung nodule. All samples were collectedprior to surgery. Inclusion criteria for controls were patients between50-80 years old, with a tobacco use of >10 pack years, and a chest X-rayor CT scan within the last six months demonstrating no evidence of lungcancer and no other cancer within preceding 5 years. Control subjectsare matched to the patient subjects based on age, race, gender, andsmoking status. Thus, the majority of controls were smokers orex-smokers greater than 50 years of age. Another control group includedpatients undergoing surgery for lung nodules in which the nodule turnsout to be benign. The NHCs are a population that would benefitsignificantly from regular monitoring due to their increased risk fordeveloping lung cancer.

Example 2 Sample Collection Protocols and Processing

Blood samples were collected in the clinic by the tissue acquisitiontechnician. Blood is collected in two CPT® tubes (Becton-Dickenson). CPTtubes were evacuated blood collection tubes containing FICOLL reagentbelow a gel insert and an anti-coagulant above the gel. This is a veryefficient and easy way to directly isolate PBMC. Blood is collected fromthe same patients during their 2-6 month follow-up visit in the clinicafter surgery. Blood samples were collected in PAXgene tubes from asubset of patients and control subjects. All coded samples, includingtissue blocks and blood components (PBMC, serum, and plasma) were storedbased on subject identification in marked freezer storage boxes at −80C.°.

Collected samples were processed through a variety of routine steps thathave been highly standardized. Samples were processed as batches(usually 20-50 samples) of both cases and controls rather than asindividual samples were collected. At every step, they were randomizedso that no particular class of patients or controls is processed as aseparate group. RNA purification was carried out using TRI-REAGENT(Molecular Research) as recommended. DNA and RNA were extracted fromeach sample and DNA was archived for future studies. RNA samples werecontrolled for quality using the Bioanalyzer and only samples with28S/16S ratios >0.75 were used for further studies. Samples with lowerratios were archived as they were still suitable for PCR validationstudies. The same amount (250 ng of total RNA) was amplified (aRNA)using the RNA amplification kit (Ambion). This provided sufficientamplified material (5-10 μg) for multiple repeats of the arrays and forPCR validation studies. All samples were amplified only once.

An alternative sample collection scheme employs the PAXgene Blood RNASystem (Preanalytix—a Qiagen/BD company) for stabilizing RNA in wholeblood samples. As PAXgene requires no special processing of the bloodsamples, it permits more ready development of standards for samplecollection. To optimize consistent collection of samples collected atmultiple sites of a clinical trial, the PAXgene Blood RNA System(Preanalytix—a Qiagen/BD company) integrates the key steps of wholeblood collection, nucleic acid stabilization, and RNA purification. Ituses standardized BD Vacutainer™ technology which contains a proprietaryreagent that immediately stabilizes intracellular RNA for days at roomtemperature, weeks at 4 C.°, and they can be stored at least a year atminus 80 C.° before purification of the RNA. The PAXgene tubes may beshipped overnight and stored at −80 C.° until use. All tubes remain atroom temperature for 2-4 hrs before freezing as this enhances RNAyields. The ability to minimize processing urgency greatly enhances labefficiency. For more details see http://www.preanalytix.com/RNA.asp.

In many ways, this was the best method for immediately preserving theRNA message populations present at the time of collection. However, thelarge amount of globin message present in these samples interfered withmessage determination on microarrays, despite the efforts to surmountthis problem. If a PCR assay is employed for the gene expressionprofiles described herein, the use of PaxGene is preferred, as theglobin message does not interfere with PCR assays

Example 3 Methods of Processing Data for Gene Expression Profiling

The ILLUMINA BeadChip is a relatively new method of performing multiplexgene analysis. The essential element of BeadChip technology is theattachment of oligonucleotides to silica beads. The beads were thenrandomly deposited into wells on a substrate (for example, a glassslide). The resultant array was decoded to determine whicholigonucleotide-bead combination is in which well. The decoded arraysmay be used for a number of applications, including gene expressionanalysis. These arrays have the same gene coverage as Affymetrix arrays(47,000 probes for 27,000 genes including splice variants) but use50-mer oligonucleotides rather than 25-mers and thus provide greaterspecificity.

The data analysis pipeline procedures using Matlab functions, coded PDAand SVM with RFE and SVM-RCE, were routinely and successfully used asevidenced by previous publications and as described herein.

A. Data Pre-Processing and Array Quality Control.

Data were processed as described generally¹ and expression levels forsignal and control probes are exported. A set of negative control probeswas used to calculate average background level and to determine signaldetection threshold. The probe expression data were normalized usingquantile normalization. The data were checked for outliers bycalculating an outlier score for each of the samples. First, Spearmancorrelation coefficients were calculated for every sample pair. Mediancorrelation for each sample (Ms), median correlation for all samplepairs (Mp) and median absolute deviation from Mp (MADp) were calculated.Outlier score (similarly to Z-score) for sample i was then calculated as(Msi−Mp)/MADp. Outlier scores were studied to pick a threshold to markpotential outliers. Usually, the samples with outlier scores of morethan 5 were considered as technical outliers. The further identificationof outliers is done through multivariate statistics such as principalcomponents (PCA) plots, multidimensional scaling, and robust PCA.

In order to reduce the experimental noise, the data is filtered byremoving non-informative probes, i.e. probes that were not detected inmajority of samples (more than 95%) or probes that do not change atleast 1.2 fold between at least two samples. If a sample had replicates,the latest replicate was taken for the analysis.

B. Unsupervised Classification.

Where appropriate, hierarchical clustering was applied using eitherEuclidean distance or correlation, and multidimensional scaling was usedto inspect datasets for evidence of outliers or subclasses. VISDA (53)was utilized for this purpose with good success.

C. Supervised Classification.

Support Vector Machine (SVM) can be applied to gene expression datasetsfor gene function discovery and classification. SVM has been found to bemost efficient at distinguishing the more closely related cases andcontrols that reside in the margins. Primarily SVM-RFE (48, 54) was usedto develop gene expression classifiers which distinguish clinicallydefined classes of patients from clinically defined classes of controls(smokers, non-smokers, COPD, granuloma, etc). SVM-RFE is a SVM basedmodel utilized in the art that removes genes, recursively based on theircontribution to the discrimination, between the two classes beinganalyzed. The lowest scoring genes by coefficient weights were removedand the remaining genes were scored again and the procedure was repeateduntil only a few genes remained. This method has been used in severalstudies to perform classification and gene selection tasks. However,choosing appropriate values of the algorithm parameters (penaltyparameter, kernel-function, etc.) can often influence performance.

SVM-RCE is a related SVM based model, in that it, like SVM-RFE assessesthe relative contributions of the genes to the classifier. SVM-RCEassesses the contributions of groups of correlated genes instead ofindividual genes. Additionally, although both methods remove the leastimportant genes at each step, SVM-RCE scores and removes clusters ofgenes, while SVM-RFE scores and removes a single or small numbers ofgenes at each round of the algorithm.

The SVM-RCE method is briefly described here. Low expressing genes(average expression less than 2× background) were removed, quantilenormalization performed, and then “outlier” arrays whose medianexpression values differ by more than 3 sigma from the median of thedataset were removed. The remaining samples were subject to SVM-RCEusing ten repetitions of 10-fold cross-validation of the algorithm. Thegenes were reduced by t-test (applied on the training set) to anexperimentally determined optimal value which produces highest accuracyin the final result. These starting genes were clustered by K-means intoclusters of correlated genes whose average size is 3-5 genes. SVMclassification scoring was carried out on each cluster using 3-foldresampling repeated 5 times, and the worst scoring clusters eliminated.Accuracy is determined on the surviving pool of genes using the left-out10% of samples (testing set) and the top-scoring 100 genes wererecorded. The procedure was repeated from the clustering step to an endpoint of 2 clusters. The optimal gene panel was taken to be the minimalnumber of genes which gives the maximal accuracy starting with the mostfrequently selected gene. The identity of the individual genes in thispanel is not fixed, since the order reflects the number of times a givengene was selected in the top 100 informative genes and this order issubject to some variation.

Using SVM-RCE, the initial assessment of the performance of eachindividual gene cluster, as a separate feature, allowed for theidentification of those clusters that contributed the least to theclassification. These were removed from the analysis while thoseclusters which exhibited relatively better classification performancewere removed. Re-clustering of genes after each elimination step waspermitted to allow the formation of new, potentially more informativeclusters. The most informative gene clusters were retained foradditional rounds of assessment until the clusters of genes with thebest classification accuracy were identified.

Utilization of the method using gene clusters, rather than individualgenes, enhanced the supervised classification accuracy of the same dataas compared to the accuracy when either SVM or Penalized DiscriminantAnalysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE)were used to remove genes based on their individual discriminantweights. The method also permitted the arbitrary determination of thenumber of clusters and cluster size at the onset of the analysis by theinvestigator and, as the algorithm proceeded, the least informativeclusters were progressively removed. The method further provided the topn clusters required to most accurately differentiate the two pre-definedclasses. These two methods are further defined in the followingexamples.

D. Biomarker Selection.

Genes which score highest (by SVM) in discriminating patients fromcontrols were examined for their utility for clinical tests. Factorsconsidered include, higher differences in expression levels betweenclasses, and low variability within classes. When selecting biomarkersfor validation an effort was made to select genes with distinctexpression profiles to avoid selection of correlated genes (55) and toidentify genes with differential expression levels that were robust byalternative techniques including PCR and/or immuno-histochemistry.

E. Validation.

Three methods of validation were considered.

Cross-Validation: To minimize over-fitting within a dataset, K-foldcross-validation (K usually equal to 10) was used, when the dataset issplit on K parts randomly and K−1 parts were used for training and 1 fortesting. Thus, for K=10 the algorithm was trained on a random selectionof 90% of the patients and 90% of the controls and then tested on theremaining 10%. This was repeated until all of the samples have beenemployed as test subjects and the cumulated classifier makes use of allof the samples, but no sample is tested using a training set of which itis a part. To reduce the randomization impact, K-fold separation wasperformed M times producing different combinations of patients andcontrols in each of K folds each time. Therefore, for individual datasetM*K rounds of permuted selection of training and testing sets were usedfor each set of genes.

Independent Validation: To estimate the reproducibility of the data andthe generality of the classifier, one needs to examine the classifierthat was built using one dataset and tested using another dataset toestimate the performance of the classifier. To estimate the performance,validation on the second set was performed using the classifierdeveloped with the original dataset.

Resampling (permutation): To demonstrate dependence of the classifier onthe disease state, patients and controls from the dataset were chosen atrandom (permuted) and the classification was repeated. The accuracy ofclassification using randomized samples was compared to the accuracy ofthe developed classifier to determine the p value for the classifier,i.e., the possibility that the classifier might have been chosen bychance. In order to test the generality of a classifier developed inthis manner, it was used to classify independent sets of samples thatwere not used in developing the classifier. The cross-validationaccuracies of the permuted and original classifier were compared onindependent test sets to confirm its validity in classifying newsamples.

F. Classifier Performance

Performance of each classifier was estimated by different methods andseveral performance measurements were used for comparing classifiersbetween each other. These measurements include accuracy, area under ROCcurve, sensitivity, specificity, true positive rate and true negativerate. Based on the required properties of the classification ofinterest, different performance measurements can be used to pick theoptimal classifier, e.g. classifier to use in screening of the wholepopulation would require better specificity to compensate for small(˜1%) prevalence of the disease and therefore avoid large number offalse positive hits, while a diagnostic classifier of patients inhospital should be more sensitive.

G. Classifier Application

A linear classifier built by SVM for a set of genes based on a trainingset can be used to assign an SVM-score to any sample. Mathematically,classifier is a set of g+1 coefficients, where g is a number of genes inthe set. If E₁, . . . , E_(g) are expression values of these genes for asample, and C₁, . . . , C_(g+1) are the corresponding coefficients, thenthe SVM-score for the sample is easily calculated as C₁E₁+ . . .+C_(g)E_(g)+C_(g)E_(g)+C_(g+1)

H. ROC Analysis

ROC analysis was performed to estimate each classifier's efficacy thattakes into consideration both, sensitivity and specificity. ROC curve isbuilt for a classifier by varying SVM-score cutoff and calculatingcorresponding sensitivity and specificity. Area under ROC curve (AUC)was calculated to use as the classifier performance measurement. Sincerandom classifier of samples would have AUC of 0.5 and perfectclassifier would have AUC of 1.0, the calculated AUC value can be usedand reported as percentage expression of the classifier efficacy.

I. Positive and Negative Predictive Values

Calculation of positive predictive values (PPV) and negative predictivevalues (NPV) take into account not only specificity and sensitivity, butalso a prevalence p of the disease:

${PPV} = \frac{{sens} \cdot p}{{{sens} \cdot p} + {\left( {1 - {spec}} \right) \cdot \left( {1 - p} \right)}}$${NPV} = \frac{{spec} \cdot \left( {1 - p} \right)}{{{spec} \cdot \left( {1 - p} \right)} + {\left( {1 - {sens}} \right) \cdot p}}$

Thus, PPV is similar to true positive rate and shows a fraction ofsubjects that actually have disease among positively classified samples,while NPV is similar to true negative rate and shows a fraction ofsubjects that actually do not have the disease negatively classifiedsamples.

PPV and NPV values were calculated for every possible SVM-score cutofffor various values of prevalence (1%, 5% and 50%). In addition to directusage of PPV and NPV values this allows identifying an SVM-score cutoffto use for classification in order to achieve specified classifierpredictive value.

Example 4 SVM Supervised Classifications

(i) SVM-RFE process was applied to a training subset of samples asfollows. T-test was performed on genes from the training set todetermine the best 1000 genes that separate two classes of samples. Foreach gene reduction step SVM was run using the remaining number ofgenes. Coefficients for these genes from the trained classifier werethen compared to eliminate genes with the least impact on thediscriminant score. Ten percent of the least significant genes wereremoved and the process was repeated until only 1 gene was left. Theperformance of classifiers for each number of genes was calculated byusing the corresponding classifier on the test set. Each gene received ascore that corresponds to the iteration step at which the gene waseliminated To eliminate over-fitting within a dataset, K-foldcross-validation (K usually equal to 10) was used. The data were spliton K parts (folds) and the algorithm was trained on K−1 folds of thecase and the control groups, and then tested on the remaining 1 fold.This guaranteed that each sample was employed as a test subject. Therandom splitting on K-folds was repeated 10 times, resulting in 100different training-testing subset pairs. Each training-testing datasplit was analyzed by SVM-RFE separately. A final gene score was thencalculated for all genes that were involved in training of at least oneclassifier. The score was equal to the average gene score across allresampling runs divided by number of elimination iterations. Thus, thehypothetical gene that reaches the maximum elimination iteration step inall of 100 SVM-RFE runs will receive a score of 1, while the gene thatwas always eliminated at the first step will receive a score of 0.Different numbers of top genes with highest scores were used tocalculate performance of the classifier built on these genes. Theclassifier with the best performance indicates the optimal number ofgenes to use for the classification.

(ii) The central algorithm of SVM-RCE method was described as aflowchart (in FIG. 3 of reference 1) which consists of three main stepsapplied on the training part of the data:

Cluster step for clustering the genes; SVM scoring step for computingthe Score(X(s_(i)), f, r) of each cluster of genes and RCE step toremove clusters with low score. The SVM-RCE method was performedaccording to the following:

It was assumed that dataset D has S genes (all of the genes or top n_ggenes by t-test) and that the data was partitioned into two parts: onefor training (90% of the samples) and the other (10% of the samples) fortesting. X denotes a two-class training dataset that consisting ofsamples and S genes. Score measurement was defined for any list S ofgenes as the ability to differentiate the two classes of samples byapplying linear SVM. The score was calculated by performing a randompartition on the training set X of samples into f non-overlappingsubsets of equal sizes (f-folds). Linear SVM was trained over f−1subsets and the remaining subset was used to calculate the performance.This procedure was repeated r times to take into account differentpossible partitioning.

Score (X(S), f, r) was defined as the average accuracy of the linear SVMover the data X represented by the S genes computed as f-folds crossvalidation repeated r times. The default values are f=3 and r=5. If theS genes are clustered into sub-clusters of genes S₁, S₂, . . . , S_(n)the Score(X(s_(i)), f, r) was defined for each sub-cluster whileX(s_(i)) was the data X represented by the genes of S_(i). n=initialnumber of clusters. m=final number of clusters. d=the reductionparameter. While (n≦m) do: 1. Cluster the given genes S into n clustersS₁, S₂, . . . , S_(n) using K-means (Cluster step); 2. For each clusteri=1 . . . n calculate its Score(X(s_(i)), f, r) (SVM scoring step); 3.Remove the d % clusters with lowest score (RCE step); 4. Merge survivinggenes again into one pool S; 5. Decrease n by d %.

The basic approach of the SVM-RCE was to first cluster the geneexpression profiles into n clusters, using K-means. A score (Score(X(s_(i)), f, r)), was assigned to each of the clusters by linear SVM,indicating its success at separating samples in the classification task.The d % clusters (or d clusters) with the lowest scores were thenremoved from the analysis. Steps 1 to Step 5 were repeated until thenumber n of clusters was decreased to m. Let Z denote the testingdataset. At step 4 an SVM classifier was built from the training datasetusing the surviving genes S. This classifier was then tested on Z toestimate the performance See the above-referenced FIG. 3 of (1), the“Test” panel on the right side.

For the current version, the choice of n and m were determined by theinvestigator. In this implementation, the default value of m was 2,indicating that the method was required to capture the top 2 significantclusters (groups) of genes. However, accuracy was determined after eachround of cluster elimination and a higher number of clusters could bemore accurate than the final two. The gist-svm package was used for theimplementation of SVM-RFE, with linear kernel function (dot product),with default parameters. In gist-svm the SVM employs a two-norm softmargin with C=1 as penalty parameter. The SVM-RCE was coded in MATLABwhile the Bioinformatics Toolbox 2.1 release was used for theimplementation of linear SVM with two-norm soft margin with C=1 aspenalty parameter. The core of PDA-RFE was implemented in C programminglanguage using a JAVA user interface.

In order to ensure a fair comparison and to decrease the computationtime, the top 300 (n_g=300) genes were selected by t-test from thetraining set for all methods. However, the use of t-statistics forreducing the number of onset genes subjected to SVM-RFE was not onlyefficient, but it also enhanced the performance of the classifier. Forall of the results presented, 10% (d=0.1) was used for the gene clusterreduction for SVM-RCE and 10% of the genes with SVM-RFE and PDA-RFE. ForSVM-RCE, the experiment was started using 100 (n=100) clusters andceased when 2 (m=2) clusters remained 3-fold (f=3) repeated 5 (r=5)times was used in the SVM-RCE method to evaluate the score of eachcluster (SVM scoring step in FIG. 3 of reference 1). More stringentevaluation parameters may be utilized by increasing the number ofrepeated cross-validations, while simultaneous increasing thecomputational time.

(iii) For evaluating the over-all performance of SVM-RCE and SVM-RFE(and PDA-RFE), 10-fold cross validation (9 fold for training and 1 foldfor testing), repeated 10 times, was employed. After each round offeature or cluster reduction, the accuracy was calculated on thehold-out test set. For each sample in the test set, a score assigned bySVM indicated its distance from the discriminate hyper-plane generatedfrom the training samples, where a positive value indicated membershipin the positive class and a negative value indicated membership in thenegative class. The class label for each test sample was determined byaveraging all 10 of its SVM scores and it is based on this value thatthe sample was classified. This method for calculating the accuracy gavea more accurate measure of the performance, since it captured not onlywhether a specific sample is positively (+1) or negatively (−1)classified, but how well it is classified into each category, asdetermined by a score assigned to each individual sample. The scoreserved as a measure of classification confidence. The range of scoresprovided a confidence interval.

Clustering methods are unsupervised techniques where the labels of thesamples are not assigned. K-means⁶⁷ is a widely used clusteringalgorithm. It is an iterative method that groups genes with correlatedexpression profiles into k mutually exclusive clusters. k is a parameterthat needs to be determined at the onset. The starting point of theK-means algorithm is to initiate k randomly generated seed clusters.Each gene profile is associated with the cluster with the minimumdistance (different metrics could be used to define distance) to its‘centroid’. The centroid of each cluster is then recomputed as theaverage of all the cluster gene members' profiles. The procedure isrepeated until no changes in the centroids, for the various clusters,are detected. Finally, this algorithm aims at minimizing an objectivefunction with k clusters:

${{F\left( {{date};k} \right)} = {\sum\limits_{j = 1}^{k}{\sum\limits_{t = 1}^{t}{{g_{i}^{j} - c_{j}}}^{2}}}},$

where t is number of genes and where ∥ ∥² is the distance measurementbetween gene g_(i) profile and the cluster centroid c_(j). The“correlation” distance measurement was used as a metric for the SVM-RCEapproach. The correlation distance between genes g_(r) and g_(s) isdefined as:

$d_{rs} = {1 - \frac{\left( {g_{r} - {\overset{\_}{g}}_{r}} \right)\left( {g_{s} - {\overset{\_}{g}}_{s}} \right)^{\prime}}{\sqrt{{\left( {g_{r} - {\overset{\_}{g}}_{r}} \right)\left( {g_{r} - {\overset{\_}{g}}_{r}} \right)^{\prime}}\;}\sqrt{\left( {g_{s} - {\overset{\_}{g}}_{s}} \right)\left( {g_{s} - {\overset{\_}{g}}_{s}} \right)^{\prime}}}}$where: ${\overset{\_}{g}}_{r} = {\frac{1}{t}{\sum\limits_{j}g_{rj}}}$and ${\overset{\_}{g}}_{s} = {\frac{1}{t}{\sum\limits_{j}g_{sj}}}$

K-means is sensitive to the choice of the seed clusters (initialcentroids) and different methods for choosing the seed clusters can beconsidered. At the K-means step, i.e., the cluster step in FIG. 3 of(1), of SVM-RCE, k genes are randomly selected to form the seed clustersand this process is repeated several times (u times) in order to reachthe optimal, with the lowest value of the objective function F(data; k).

The SVM-RCE method differs from related classification methods in theart since the SVM-RCE method first groups genes into correlated geneclusters by K-means and then evaluates the contributions of each ofthose clusters to the classification task by SVM.

Example 5 Use of Support Vector Machine (SVM) Algorithms and RecursiveCluster Elimination (RCE) to Select Significant Genes for ComparativeGene Expression in Lung Cancer

In this example, the SVM-RCE algorithm for gene selection andclassification was demonstrated using two (2) datasets. As noted above,this novel algorithm combines the K-means algorithm for gene clusteringand the machine learning algorithm (SVM) to identify and score (rank)those gene clusters for the purpose of classification and gene clusterranking Recursive cluster elimination (RCE) was then applied toiteratively remove those clusters of genes that contribute the least tothe classification performance.

This algorithm was performed using the Matlab™ version of the SVM-RCEalgorithm which may be downloaded from http://showelab.wistar.upenn.eduunder the “Tools->SVM-RCE” tab. In summary, the SVM-RCE algorithm wasevaluated in this example using head and neck tumor datasets (I) and(II) set forth below.

For Dataset (I), gene expression profiling was performed on a panel of18 head and neck (HN) and 10 lung cancer (LC) tumor samples usingAffymetrix® U133A arrays, as described in Vachani et al., Accepted Clin.Cancer Res., 2001, which is hereby incorporated by reference. ForDataset (II), gene expression profiling was performed on a panel of 52patients with either primary lung (21 samples) or primary head and neck(31 samples) carcinomas, using the Affymetrix® HG_U95Av2 high-densityoligonucleotide microarray⁶⁸.

Three algorithms, i.e., SVM-RCE, PDA-RFE and SVM-RFE, were used toiteratively reduce the number of genes from the starting value in thesedatasets (I) and (II) using intermediate classification accuracy as ametric. In summary, the accuracy of the SVM-RCE algorithm at the final 2gene clusters, and two intermediate levels, usually 8 and 32 clusters,which correspond to 8 genes, 32 genes and 102 genes, respectively, wasdetermined. For the SVM-RFE and PDA-RFE algorithms, the accuracy forcomparable numbers of genes was also determined See Table VIII.

TABLE VIII Head & Neck vs. Head & Neck vs. Lung Tumors (I) Lung Tumors(II) # accuracy # # accuracy clusters # genes (ACC) clusters genes (ACC)Algorithm (#c) (# g) (%) (#c) (# g) (%) SVM-RCE 2 8 100 2 9 100 8 32 1006 32 100 28 103 100 25 103 100 SVM-RFE 8 92 8 98 32 90 32 98 102 90 10298 PDA-RFE 8 89 8 70 31 96 32 98 109 96 102 98

The results comparing the independent use of the SVM-RCE and SVM-RFEalgorithms on dataset (I) illustrated that the SVM-RCE algorithm had anincrease in accuracy over in the SVM-RFE algorithm. Specifically, anincrease in accuracy of 8%, 10% and 10% with about 8, about 32, andabout 103 genes, respectively, was obtained. Similarly, the resultsusing these two algorithms on dataset (II) showed an about 2% increasewith the SVM-RCE algorithm, using about 8, about 32, and about 102 ofgenes (100% ACC). The SVM-RFE algorithm, however, showed an about 98%ACC. These results clearly demonstrate the superiority of the SVM-RCEalgorithm over the SVM-RFE algorithm.

It was also noted that the execution time for the SVM-RCE algorithmusing the MATLAB code was greater than the execution time for theSVM-RFE algorithm, which uses the C programming language. For example,when the SVM-RCE was applied on a personal computer with a P4-Duo-core3.0 GHz processor and 2 GB of RAM on the dataset (I), the results wereobtained in approximately 9 hours for 100 iterations (10-folds repeated10 times). The same results were obtained using the SVM-RFE algorithm(with the svm-gist package) in 4 minutes. To determine the reliabilityof these results, the SVM-RCE algorithm was again performed on dataset(I), while simultaneously tracking the performance at each iteration andover each level of gene clusters. The results obtained using the SVM-RCEalgorithm, regardless of the iterations, had a standard deviation of0.04 to 0.07. The results obtained using the SVM-RFE algorithm had astandard deviation of 0.2 to 0.23. These results show that the SVM-RCEalgorithm was more robust and more stable than the SVM-RFE algorithm.

The same superiority of the SVM-RCE algorithm was observed whencomparing the SVM-RCE algorithm with the PDA-RFE algorithm. See,published Table 1¹ and FIG. 1 ¹ which use hierarchal clustering andmultidimensional scaling (MDS) to help illustrate the improvedclassification accuracy of the SVM-RCE algorithm for dataset (I). Thegenes selected by the SVM-RCE algorithm clearly separated the twoclasses while the genes selected by the SVM-RFE algorithm placed one ortwo samples on the wrong side of the separating margin.

It was also noted that the execution time for the SVM-RCE algorithmusing the MATLAB code was greater than the PDA-RFE algorithm, which usesthe C programming language.

The convergence of the algorithm to the optimal solution, and to give amore visual illustration of the SVM-RCE algorithm, was alsodemonstrated. In summary, the mean performance over all of the clustersfor each reduction level for dataset (I) was calculated. See, publishedFIG. 1 ¹ in which ACC is the accuracy, TP is the sensitivity, and TN isthe specificity of the remaining genes determined on the test set. Avdis the average accuracy of the individual clusters at each level ofclusters determined on the test set. The x-axis provides the averagenumber of genes hosted by the clusters.

In summary, 1000 genes were selected by t-test from the training set,distributed into 300 clusters (initial number of clusters (n)=300, finalnumber of clusters (m)=2, the reduction parameter (d)=0.3, n−g=1000) andthen recursively decreased to 2 clusters. The mean classificationperformance on the test set per cluster at each level of reduction(published FIG. 1 ¹, line AVG) dramatically improved from about 55% toabout 95% as the number of clusters decreased. The average accuracy alsoincreased as low-information clusters were eliminated. These resultssupport the suggestion that less-significant clusters were removed whileinformative clusters were retained as the RCE algorithm was employed.

The SVM-RCE algorithm was also useful in estimating stability, asevidenced by the results on dataset (I). The stability was estimated byobtaining values of u (u=number of times the process is repeated) of 1,10, and 100 repetitions and comparing these values to the mostinformative 20 genes returned from each experiment. About 80% of thegenes were common to the three runs, which suggested that the SVM-RCEalgorithm results were robust and stable.

In summary, these data illustrate that the SVM-RCE algorithm providesimportant information that cannot be obtained using algorithms in theart which assess the contributions of each gene individually. Althoughthe initial observations were based on the top 2 clusters needed forseparation of datasets with 2 known classes of samples, i.e., datasets(I) and (II), the analysis may be expanded to capture, e.g., the top 4clusters of genes.

The results suggest that the selection of significant genes forclassification, using the SVM-RCE algorithm, was more reliable than theSVM-RFE or PDA-RFE algorithms. The SVM-RFE algorithm uses the weightcoefficient, which appears in the SVM formula, to indicate thecontribution of each gene to the classifier. The success of the SVM-RCEalgorithm suggested that estimates based on the contribution of genes,which shared a similar profile (correlated genes), was important andgave each group of genes the potential to be ranked as a group.Moreover, the genes selected by the SVM-RCE algorithm were guaranteed tobe useful to the overall classification since the measurement ofretaining or removing genes (cluster of genes) was based on theircontribution to the performance of the classifier. The unsupervisedclustering used by the SVM-RCE algorithm is also useful in identifyingbiologically or clinically important sub-clusters of samples.

Example 6 Assay Formats

To provide a biomarker signature that can be used in clinical practiceto diagnose lung cancer, a gene expression profile with the smallestnumber of genes that maintain satisfactory accuracy is provided by theuse of three or more of the genes identified in the Table I, II, III orIV. These gene profiles or signatures permit simpler and more practicaltests that are easy to use in a standard clinical laboratory. Becausethe number of discriminating genes is small enough, quantitativereal-time PCR platforms are developed using these gene expressionprofiles.

A. Quantitative Realtime PCR (RT-PCR)

A diagnostic assay as described herein may employ TAQMAN® Low DensityArrays (TLDA). The gene expression profiles described herein suggest thenumber of genes required is compatible with these platforms. RT-PCR hasbeen considered to be the “gold standard” for validating array results.However in building a PCR-based diagnostic, problems of reproducibilityincrease as the number of genes required for the diagnosis increase and,more critically, if the differences in expression levels are small.

Initially a TAQMAN® Low Density Array microfluidics card designed toassay for 24 genes in duplicate using Multiplexed TAQMAN® assays wasused. This particular configuration assays 8 different samples that areloaded in the numbered ports at the top of the card. A profile of 24genes was tested in duplicate with 8 samples per card. Each sample wasassayed in duplicate in wells preloaded with the specific gene assaysreducing variability associated with single well assays. The reversetranscription reactions for each of 8 samples were loaded in the wellsat the top labeled 1-8. This platform is useful both for validation ofarray results and for development of a diagnostic platform to be testedon new samples. Using the TLDA cards significantly simplifies arrayexpression validation as well as provides a reasonable alternative tothe StaRT PCR and Focused array platforms for classifier validation.

B. StaRT PCR

StaRT PCR (Gene Express) is essentially a competitive PCR with internalstandards for both the gene of interest and the housekeeping gene(s).Having internal controls for housekeeping and experimental genes, it hasthe advantage of providing a known reference in each sample and a directquantification of message copy numbers rather than relative copy number,as referenced to a standard curve with a reference RNA. This techniqueis presently the only technology that meets FDA guidelines forMulti-Gene Assay Methods for Pharmacogenomics. The high absoluteaccuracy of this method is replaced in the methods described herein bythe use of multiple genes and internal controls. However, the diagnosticarray may be tested against StaRT PCR to compare accuracy and cost.

C. Focused Diagnostic Gene Array

As the diagnostic profiles were developed, the results from Illuminaarrays were compared with RT-PCR data from the TLDAs. Either a customILLUMINA array or a custom TLDA may be designed for clinical use.

Example 7 Studies Using an Array Diagnostic and PCR Tool to DiagnoseLung Cancer in Samples from Patients with Small, Undiagnosed LungNodules

The diagnostic utility of the clinical assays described above wasvalidated. The study population consisted of subjects in whom a lungnodule had been identified by either chest X-ray or chest CT scan. Thisgroup of patients represented an ideal population for biomarker use fortwo main reasons. First, the overall risk of lung cancer was relativelyhigh (18-50%)¹⁷ in this group, depending on nodule size (>0.8 cm).Second, there were significant risks and costs associated with thediagnostic evaluation of these patients, which generally involved serialCT scans, PET scans, invasive biopsy procedures, and, in some cases,surgery.

Study subjects were patients with a solitary, non-calcified pulmonarynodule (>0.8 cm and <3 cm in diameter) detected by chest X-ray or chestCT scan. Only subjects without specific symptoms suggestive ofmalignancy (e.g. hemoptysis, significant weight loss) were included(i.e. asymptomatic patients). Non-specific symptoms (e.g. dyspnea orcough) are fairly common in current or ex-smokers and therefore subjectswith these symptoms are included. Patients discovered to have anon-calcified lung nodule were usually evaluated based on the clinicallikelihood of malignancy. Thus, all subjects in this cohort wereultimately identified as either a lung cancer case or a control subjectbased on specific pathologic and clinical criteria discussed above. Thecase subjects used in this aim were similar to the case subjectsdescribed in the examples above.

The control population (subjects with benign nodules) were differentfrom the control population described in the examples above in that onlyhigh-risk patients with nodules were included. Controls were confirmedby pathologic analysis or radiographic stability for more than twoyears.

The data from the quantitative RT-PCR assays or focused gene arrays wereevaluated as diagnostic tests. The main analysis estimated thesensitivity and specificity of the gene expression profiles describedabove. As the sensitivity and specificity depend on the cutoff value ofthe quantitative RT-PCR value (for a single biomarker) or the lineardiscriminant score (for an array of biomarkers), a receiver operatingcharacteristics (ROC) analysis was executed that plots the sensitivityand specificity as a function of the cutoff value. The area under theROC curve was estimated by conventional methods⁵⁹.

The positive predictive value (PPV) and the negative predictive value(NPV)—that is, the probability that a subject with a positive testactually has cancer (the PPV) and the probability that a subject with anegative test does not have cancer (the NPV) were estimated. As thesequantities depend on the prevalence of cancer in the group being tested,as well as the sensitivity and specificity of the test, these quantitieswere computed for a range of possible prevalences likely to hold indifferent clinical populations. Subgroup analysis was performed todetermine the effect of race, gender, and smoking status on the accuracyof the discriminant score.

A logistic regression analysis (virtually equivalent to lineardiscriminant analysis (LDA)) of the target markers was performed, andcertain clinical variables were evaluated using the bootstrap approach⁶⁰to correct for over-fitting in the estimation of such indices ofprediction as the area under the ROC curve and the Cronbach alphastatistic. Important clinical variables (nodule size, pack-years, yearssince quitting, age, and gender) were used to create a baselinepredictive model. The value of the gene expression biomarkers forpredicting lung cancer were evaluated by creating additional models thatwill incorporate the linear discriminant score. This analysisestablished the incremental value of the gene expression biomarkers aspart of the clinical evaluation of patients with asymptomatic lungnodules.

To determine whether the biomarker is useful as a trigger for change inintervention in a trial, sample size estimates were based on targetvalues for specificity of 0.9 and sensitivity of 0.9. To ensure thatconfidence intervals for the sensitivity and specificity extend no morethan 5% from the estimated values, at least 138 cases and 138 controlswere used.

Example 8 Determining Positive (PPV) and Negative (NPV) PredictiveValues for the NSCLC Vs. NHC Profile

Values for the PPV and NPV calculated for the sensitivity andspecificity attained testing the combined NSCLC cancers versus NHCsamples are shown in Table IX below. Prevalence values suggested by theEDRN Lung Cancer Biomarker Group (LCBG) available at(http://edrn.nci.nih.gov./resources/sample-reference-sets) were adoptedfor screening purposes. The prevalence value is 0.01 for an at-riskpopulation age >50 and a smoking status >30 pack-years, and 0.05 for anindividual exhibiting an abnormal CT scan, with a non-calcified nodulebetween 0.5 and 3 cm. These PPV and NPV values were compared to thevalues considered to be useful for additional study by the LCBG, and tothe values determined from a recent study using an 80-gene profileobtained from bronchial brushings, assuming the same prevalence. The 15gene classifier (see Table IV, col. NSCLC/NHC) already exceeds theperformance suggested by LCBG for a good biomarker candidate, and alsoexceeded that of the most recently published lung cancer biomarkerspecificity.

TABLE IX Positive and Negative Predictive Values for 15 gene NSCLC vs.NHC Profile Subject Sensitivity Specificity Prevalence PPV NPV 80 geneclassifier 0.83 0.76 1% 0.034 0.998 (Spira et al, 51) 5% 0.154 0.988LCGB proposed 0.80 0.70 1% 0.026 0.997 biomarker 5% 0.123 0.985 NSCLCvs. NHC 15 0.86 0.79 1% 0.040 0.998 gene classifier 5% 0.177 0.991

Example 9 Power Calculations

In order to estimate the number of samples required to achieve aspecified accuracy from classification, the method outlined byMukherjee⁵² was used. The estimation was done by building an empiricallearning curve that expressed classification error rate e as a functionof training set size n, according to: e(n)=an^(−α)+b where a, α, b areto be found by fitting the curve to the observed error rates when usinga range of training set sizes drawn from a preliminary dataset. Thepreliminary dataset in this case consisted of 78 NSCLC of mixed celltypes and 52 NHC samples, resulting in 130 samples available for powercalculations. This was the most difficult classification set. Errorrates were recorded taking training subsets of sizes 25, 32, 38, 45, 51,58, 64, 70 and 77 samples (corresponds to approximately from 20% to 60%of samples) conserving original proportion of NSCLC and NHC cases.

SVM was run 50 times for each training set size using random sampleseach time classifying samples using the 500 best genes selected byt-test between cases and controls. Average error rates, along with 25%and 75% percentiles for each training set size were used to fit thelearning curve. The error rate for this classifier built using 117 (90%)samples as training set is observed on the ROC curve with an AUC of0.867 (not shown). The accuracy of 83% (error of 17%) lies on thecalculated curve (not shown). Actual error rate was 0.17 observed forthe maximum training size available from preliminary data. Error rateapproximations of 25% and 75% were detected in one set (data not shown).

Example 10 Classification of Early Stage Lung Adenocarcinoma (AC) andLung Squamous Cell Carcinoma (LSCC) from PBMC Using cDNA Arrays

To determine whether it was possible to detect a gene expressionsignature in the peripheral blood that can be correlated with earlyNSCLC, samples from AC and LSCC patients were used since these representabout 85% of all NSCLC. Less common forms of NSCLC (e.g. large cellcarcinoma) may also be detected by a classifier built on the more commonNSCLC types.

Processing of all samples for RNA purification was carried out understandardized conditions.

The inventors generated a classifier by obtaining PBMC RNA from sets of“non-healthy” control patients (NHC) and patients with various types andstages of NSCLC and performing microarray analysis using a cDNAplatform, i.e., nylon cDNA arrays manufactured at the Wistar GenomicsCore.

The analysis was carried out using Support Vector Machines withRecursive Feature Elimination (SVM-RFE), as described in Examples 4 and5 and in other publications⁴⁸. In some cases, Support Vector Machineswith Recursive Cluster Elimination (SVM-RCE) algorithm (InternationalPatent Application Publication No WO 2004/105573) was used. Initialattempts to classify patients from controls from PBMC using SVM-RFEresulted in error rates for some of the comparisons, in particular allcancer vs. NHC, too high to be useful (average accuracy about 70%). Toaddress the low signal/noise ratio, a new algorithm SVM-RCE wasdeveloped which clusters genes (by K-means clustering) into groups whosedifferential expression is correlated, and recursively eliminates theleast informative clusters instead of individual genes. This results inthe final selection of groups of genes whose differential expressionchanges together. On 6 published datasets¹, this method was shown to bemore accurate at classification than SVM-RFE or penalized discriminantanalysis (PDA-RFE), and in some cases also results in biologicallymeaningful clustering of samples. It is most useful for data with lowsignal/noise or high variance since using gene clusters as variablesminimizes the effects of both these aspects of the data.

Whether SVM-RFE or SVM-RCE was applied, in order to eliminateover-fitting within a dataset, M-fold cross-validation (with M equal to10) was used. The algorithm was trained on M−1 folds of the case and thecontrol group, and then tested on the remaining 1 fold. This guaranteesthat each sample is employed as a test subject. The average score foreach patient was calculated as well as the average score for each gene.The least informative gene(s) were eliminated, and the process repeated.Tables XA and XB show the classification accuracy and the sensitivity(true positive rate) and specificity (true negatives rate) versus thenumber of genes used for classification. The analytical approaches aredescribed in detail above.

Data for the 208 patients and controls listed in Table XA are shown inTable XB. This data were processed in 3 different “batches” of arrayscalled sets 3, 4, and 5. As described in Table XA, samples were groupedas early stage adenocarcinomas (AC T1T2), late stage adenocarcinomas (ACT3T4), early stage squamous cell lung cancer (LSCC T1T2) and thenon-healthy controls (NHCs). Both cases and controls were usually oldersmokers or ex-smokers.

Second, although the classification of early stage NSCLCs was moredifficult, quite good accuracy could be achieved comparing either theACs vs. NHCs or LSCCs vs. NHCs alone and for a combined AC+LSCCclassifier (Table XB— upper 3 lines). The AC+LSCC comparison to NHCsinitially required 287 genes to classify combined early stage sampleswith 80% accuracy NHCs (line 1). However these results suggested itwould be possible to develop a more general classifier that would detecteither ACs or LSCCs. When the ACs and LSCCs were segregated andclassified separately, 160 genes were initially required to distinguishearly ACs from the NHCs with 85% accuracy and only 56 genes to identifythe LSCC at the same accuracy. The comparison between the early ACs andLSCCs samples was then found to require only 21 genes for thediscrimination, confirming the inventors' previous observations ofsignificant differences between these 2 NSCLC cell types. Ultimately, asshown in Table IV, col. “AC/NHC”, a gene profile of 15 genes candistinguish AC from other forms of NSCLC. Further analysis isanticipated to demonstrate that as few as 6 genes are necessary for thisprofile, as with the pre/post surgery profile formed by the top 6 genesof Table IV, col. Pre/Post.

TABLE XA Summary of Samples Analyzed on cDNA Arrays AC T1T2 59 AC T3T418 LSCC T1T2 36 LSCC T3T4 12 NHC 95

TABLE XB # Genes Req'd for Accuracy Classif'n/# of Specif- SampleClasses Compared clusters Classif'n Sensitivity icity ¹AC + LSCC T1T2vs. 287/22 0.8  0.82 0.78 NHC ¹AC T1T2 vs. NHC 160/11 0.85 0.83 0.85²LSCC T1T2 vs. NHC 105 0.87 0.72 0.93 ²LSCC T1T2 vs. NHC 56/2 0.85 0.900.84 ²AC T1T2 vs. LSCC T1T2 21 0.88 0.92 0.81 ²AC T1T2 vs. LSCC T1T2  30.85 0.86 0.83 AC T1T2 vs. AC T3T4 10 0.92 0.98 0.72 ¹AC T3T4 vs. NHC15/2 0.88 0.77 0.94 ¹SVM-RCE was used for these analyses. ²Twoaccuracies were reported where a small decrease in accuracy results froma large decrease in the number of genes

Since the differences in gene expression detected between cases andcontrols could be caused by a change in some fraction of the PBMCpopulation, a small flow cytometry study comparing PBMC fractions from14 NHC lymphocytes to lymphocytes from 14 patients with AC, 15 patientswith LSCC, and 6 other NSCLC was performed. In agreement with recentfindings⁴⁹ for patients with malignant melanoma, there was nostatistically significant difference in proportions of CD4 or CD8T-Cells, B-cells, NK-Cells or monocytes between cases and controls.

Example 11 Classification of Early Stage (T1/T2) NSCLCs from NHCs onIllumina Q-PCR Arrays

cDNA array results required 287 genes to distinguish the combinedclasses of NSCLC samples from the NHCs (see Table XB-line 1, above). TheIllumina data however permitted development of a more accurate andglobal classifier for AC/NHC classification with many fewer genes. TheIllumina data available for this analysis included 78 NSCLCs (including51 ACs, 15 LSCCs, 12 unclassified NSCLCs) and 52 NHC samples. TheSVM-RFE analysis indicated 15 genes could classify this dataset with anaccuracy of 83%. See Table IV above. The SVM scores for the individualpatients and controls shown in FIG. 3 were produced from the performanceof the 15 gene classifier of Table IV, col. NSCLC/NHC. These resultsshow that a more general classifier can be used to classify the two mainNSCLC cell types.

In one experiment, PBMC from 44 patients with small AC (T1 or T2 sizetumors) vs. PBMC from 95 age-, gender- and smoking-matched controls wereused. Discriminant scores were generated using nylon arrays and SVM-RCEas described above. The results are provided in FIG. 2. A positive scoreindicates lung cancer and a negative score indicates no cancer. Eachcolumn represents a single patient or control sample. The height of thecolumn is a measure of how well an individual sample is classified. Thecontrol samples are on the right and are given a negative score. Thepatients are on the left. Lighter bars with a positive score aremisclassified controls and darker bars with a negative score aremisclassified cases. Samples at the margin with scores close to zeroshould be unclassified. Only the AC T1T2 samples are shown. The samplesin the middle where the columns switch from positive to negative orvice/versa are misclassified. Using this classifier employing 15 genesof Table IV, col. AC/NHC, the presence of early stage lung cancer wasidentified with 85% accuracy.

In still another experiment, forty-four (44) early stage T1T2 AC patientsamples were compared to 52 NHC. Genes were filtered by t-test and thenSVM-RFE was applied (see Example 4 or 5) and the 15 genes selected bySVM-RFE were used (Table IV, col. AC/NHC). Classification accuracieswere analyzed with progressive gene elimination (from 2781 genes to 1)by SVM-RFE⁴⁸ (data not shown), measuring True Positives, i.e., thenumber of patients the classifier correctly assigned a positiveSVM-score and True Negatives, i.e., the number of controls theclassifier correctly assigned a negative SVM-score. Accuracy was plottedas (TP+TN)/n (n=total number of samples). The favorable s/n and lowervariance using the Illumina arrays made the use of the SVM-RCE algorithmunnecessary. SVM-RFE was used for all the Illumina studies as SVM-RCErequires much longer run times then SVM-RFE. The optimal classifier isselected based on the best accuracy with the smallest number of genes.Expression levels of just 15 genes (e.g., the top 15 genes of Table IV,column labeled ALL/NHC), was found to discriminate the early stage T1T2ACs from the NHCs with an overall accuracy of 85%. This same accuracywas found with cDNA arrays, but 160 genes were initially needed for thisdegree of separation. These results confirm that the generation of thegene expression profile is not platform specific. The inventor'soriginal discovery of the gene expression profile was affirmed on asecond and quite different platform.

Example 12 Changes in Tumor Associated Signatures in PBMC after Removalof the Tumor

To identify a signature that reflects the tumor presence and is usefulfor the assessing the probability of recurrence, PBMC profiles from thesubset of patients with early lung cancers who had blood samples takenbefore and soon (2-6 months) after “curative” surgery were compared.This minimized background “noise”, so that a gene expression signaturecorrelated with the presence of the tumor can be more readilyidentified. Reversion of the PBMC profile to a “lung cancer” profilethus predicts recurrence.

A. Effect of Presence of the Tumor

In order to determine whether the difference in gene expression profilesseen between cases and controls was dependant on the presence of thetumor, the inventors examined how PBMC samples taken from the same NSCLCpatient taken pre-surgery and then again ˜2-6 months post surgery wereclassified with the 15 gene classifier that was selected in a comparisonof 78 NSCLC patient and 52 NHCs (see FIGS. 3 and 4). The genes selectedin this comparison as the pre-post samples were derived from patientswith either AC, LSCC or indeterminate NSCLC. The pre-surgery NSCLCsamples were included in the analysis shown in FIGS. 3 and 4, but thepost-surgery samples were not included. The post surgery samplescomprise an independent test set. The rationale was to determine whetherthe patient samples collected post surgery retained the tumor signature,which in this case is indicated by a positive predictive score, orwhether the removal of the tumor would diminish the tumor signature andthey would now score more like the controls. The odds of this occurringby chance are <0.01.

13 out of 16 of the patient pairs exhibited a decrease in the tumorpredictive score after surgery. Six of the cases have positivepre-surgery scores and a post surgery score that is negative placingthem clearly in the control class while 4 additional samples hadsignificant drops in the post-surgery samples bringing them close tozero. Two of the cases had no change in the tumor score after surgeryand 1 case had an increase in the tumor score. Two of the cases have anegative pre-surgery score but even in this case it becomes morenegative in the post-surgery sample. Additional patient follow updetermines the extent to which the post-surgery scores are prognosticfor recurrence. The observation that the tumor signature decreased afterthe removal of the malignancy supported the gene expression profile orsignature as a response to the presence of the tumor. See FIGS. 5 and 6.

B. Comparison of Pre- and Post Surgery Samples.

The pre-surgery samples were compared to the post-surgery samples todetermine whether the 2 classes of samples could be separated based onthe intrinsic differences that were demonstrated in the pairwiseanalysis in FIG. 3. The 16 pre-surgery samples were compared to the 16post-surgery samples. SVM-RFE was carried out starting with the top1,000 genes identified by t-test using 10-fold cross-validation repeated10 times. Just six genes were determined to distinguish the pre from thepost samples with an accuracy of 93%. This 6 gene classifier (the topgenes identified in Table IV (col. Pre/Post) was then used to generatethe discriminant scores for the pre- and post surgery samples as shownin FIG. 3. The pre-surgery samples (dark shading) are all classifiedcorrectly although one sample has a score close to zero. One of thepost-surgery samples has a negative score close to zero and 2 aremisclassified. This result suggests that a classifier could be developedthat might be effective in screening post-surgery patients forrecurrence because it would provide the possibility to comparepost-surgery scores with the initial pre-surgery score of the samepatient over time. Follow-up samples provide a sensitive indicator ofrecurrence.

In another study, using Illumina array data, genes were selected bycomparison of pre-surgery lung cancer samples with NHC smoker controls.Fifty-four (54) genes were used to classify the post-surgery samples. Adiscriminant score was given to each sample (positive is indicative oflung cancer; negative is indicative of no cancer). In the early analysis(not shown) in all but one comparison, the post score is lower than thepre-surgery sample score, which is adjacent. In three cases, the scoreof the post surgery sample is negative, classifying those samples withthe COPD controls. This data supports the detection of a tumor-relatedgene expression signature that diminishes after surgery. The extent ofthose changes reflects the possibility of recurrence.

Given the positive results of the pilot study on 16 paired samplespresented here, the utility of this test lies in its application inconjunction with the presence of a lung nodule detected by otherprocedures such as CT scans. Furthermore, NSCLCs of different cell types(ACs and LSCCs) can be differentiated by a signature designed to makethat distinction.

Example 13 Comparison of the Top 15 Genes as Ranked by SVM-RFE for the 3SVM-RFE Classifiers

The 15 top genes by SVM-RFE rank from the 3 Illumina studies are listedin Table IV above. The ranks for each of the genes as assigned in theindividual studies by SVM are maintained in Table IV. For the AC/NHCcomparison and the comparison of all NSCLC cell types to NHC (ALL/NHC)the 15 genes listed are the genes used to assign the SVM scores shown inFIGS. 2, 4, and 6. The 15 genes for the ALL/NHC comparison werep<3×10⁻⁵. The 15 AC/NHC genes were p<2×10⁻⁴ and the Pre/Post genes werep<6×10⁻³. The first 6 genes in the PRE/POST column were used to generatethe scores for FIG. 4. The genes shown in bold type are common to either2 or 3 comparisons. The genes that are not common to the 3 classifiersare not necessarily unique to that comparison but may simply appear at alower rank position in the extended gene lists. Eight of the top ranked15 genes for the AC/NHC and the ALL/NHC appear in both lists. Of the top6 genes used for the PRE/POST classification 3 are listed in either oneor both of the other lists. Two probes for HSPA8 are listed. The (A)indicates all HSPA8 isotypes are detected by this probe, (I) indicates aspecific isotype (in this case transcript variant 1) is detected by thesecond HSPA8 probe.

Data on the cDNA array platform reported classification accuracies forcomparisons of NSCLCs of different cell types and T stages to NHCs andto each other. The inventors' preliminary data on the Illumina platformwas restricted to those patients with early stage AC vs. NHCs orcombined NSCLCs vs. NHCs. This was by choice, since ACs are the mostcommon type of NSCLCs and it was important to minimize histologicalheterogeneity in the initial samples to be analyzed on the new platform.A more general classifier includes a more diversified sample set ofcases including LSCCs and indeterminate NSCLCs. Additional samplesassayed on the Illumina arrays demonstrate whether the particularsubtypes of lung cancer (i.e. AC vs. LSCC) have their own distinctexpression patterns as the cDNA arrays suggest and/or whether there is aPBMC signature that can accurately identify all early NSCLCs.

In one embodiment, the ALL/NHC column of Table IV shows the 15 geneprofile to identify an NSCLC from controls. In another embodiment, theAC/NHC column of Table IV shows the 15 gene profile to identify an AD.In still another embodiment, PRE/POST column shows the 15 gene profileto identify the efficacy of surgical resection of the tumor andprognosis going forward. As described above, this gene profile hassuccessfully been reduced to only the top 6 genes of that column. It isanticipated that smaller gene selections will be identified for theother two indicated profiles as well. In another embodiment, cell typespecific signatures using genes that are present in all three signaturesis anticipated to augment the predictive power of these reported scores.

Example 14 29 Gene Expression Signature

To identify a gene expression signature in PBMCs which would accuratelydistinguish patients with lung cancer from non-cancer controls withsimilar risk factors (i.e. matched for age, gender, race. smokinghistory), microarray gene expression profiles in peripheral bloodmononuclear cells (PBMC) from patients with NSCLC were compared to acontrol group with smoking-related non-malignant lung disease. Adistinguishing gene signature was found and validated on 2 independentsets of samples not used for gene selection. Gene expression changeswere also compared between pre- and post-surgery samples from 18patients.

A novel 29-gene diagnostic signature (genes ranked 1-29 of Table V) wasfound which distinguishes individuals with NSCLC from controls withnon-malignant lung disease with 91% Sensitivity, 79% Specificity and aROC AUC of 92%. Accuracies on independent sets of 18 NSCLC samples fromthe same location and 27 samples from an independent location were 74%and 79%, respectively. The 29 gene signature was significantly reducedafter tumor removal in 83% of a subset of 18 patients in whom geneexpression was measured before and after surgical resection.

Although both smoking and COPD each affect PBMC gene expression, theadditional response to a tumor presence can be identified, allowing thediagnosis of patients with lung cancer from controls with high accuracy.The PBMC signature is particularly useful in the diagnostic algorithmfor those patients with a non-calcified lung nodule. The observationthat the 29-gene signature diminishes after surgical resection, supportsthat it is tumor related.

Study Populations: Study participants (Table XVI) for the initialtraining and validation sets were recruited from the University ofPennsylvania Medical Center (Penn) during the period 2003 through 2007:91 subjects with a history of tobacco use without lung cancer including41 subjects that had one non-calcified lung nodule diagnosed as benignafter biopsy and 155 patients with newly diagnosed, histopathologicallyconfirmed non-small cell lung cancer. Subjects with any prior history ofcancer or cancer treatment except non-melanoma skin cancer wereexcluded. The study was approved by the Penn Institutional ReviewBoards. An additional 27 patients and controls were collected at NewYork University (NYU) Medical Center under IRB approval and are alsolisted in Table XVI.

TABLE XVI Summary of demographics Number of Category patients All NSCLCvs. NHC experiment samples Total 228 Controls 91 Patients 137 has COPD128 no COPD 82 unknown COPD 18 no COPD 82 Smokers 34 Quit smoking 170Never smokers 24 Patients from NSCLC vs. NHC experiment Total 137 AC 85LSCC 42 NSCLC 10 has COPD 63 no COPD 65 unknown COPD 9 Stage 1A 48 Stage1A + 1B 75 Stage 4 5 Stage ½ 93 Stage ¾ 44 Stage 2/3/4 62 AC 1A 30 AC 148 AC 2/3/4 37 LSCC 1A 16 LSCC 1 24 LSCC 2/3/4 18 Smokers 26 Quitsmoking 102 Never smokers 9 Controls from NSCLC vs. NHC experiment Total91 pure COPD (nothing else) 38 GI/NM 41 has COPD 65 no COPD 17 unknownCOPD 9 Smokers 8 Quit smoking 68 Never smokers 15 Pre-post pairs Total18 AC 10 LSCC 6 NSCLC 2 NYU samples Total 27 AC 12 NHC 15

PBMC Collection and Processing: Lung cancer patients and patients withnon-malignant lung disease had blood collection prior to surgery and/orprior to treatment with chemotherapy. Control patients had blood drawnin conjunction with a clinical visit. Blood samples were drawn in two“CPT” tubes (BD). PBMC were isolated within 90 minutes of blood draw,washed in PBS, transferred into RNA Later (Ambion) and then stored at 4°C. overnight before transfer to −80° C. A subset of patient PBMC's wereanalyzed by flow cytometry with anti-CD3, CD4, CD8, CD14, CD16, CD19, orCD-56 antibodies or isotype controls (BD Biosciences) and analyzed usingFlo-Jo software. Samples from NYU were processed within 2 hours fromcollection, PBMC were transferred to Trizol (Invitrogen) and stored at−80° C. Extracted RNA was transferred to Wistar for further processing.

Sample Processing: RNA purification of the first set of samples “Penn”was carried out using TriReagent (Molecular Research) as recommended andcontrolled for quality using the Bioanalyzer. Only samples with 28S/16Sratios >0.75 were used for further studies. A constant amount (400 ng)of total RNA was amplified as recommended by Illumina. The second set ofsamples “NYU” were DNAse treated before hybridization. Samples wereprocessed as mixed batches of patients and controls and hybridized tothe Illumina WG-6v2 human whole genome bead arrays(http://www.illumina.com/pages.ilmn?ID=197)

Array quality control and pre-processing: All arrays were checked foroutliers by computing gene-wise between-array median correlation andcomparing it with correlation for each array. Non-informative probeswere removed if their intensity was low relative to background inmajority of samples or if maximum ratio between any 2 samples was not atleast 1.2. Arrays were then quantile normalized and background wassubtracted from expression values.

Analysis: Classification was performed using a Support Vector Machinewith recursive feature elimination (SVM-RFE)¹⁹ using 10-foldcross-validation repeated 10 times. Classification scores for eachtested sample were recorded at each reduction step, down to a singlegene. Average accuracy for each reduction step was calculated and allthe genes at the points of maximal accuracy formed the initialdiscriminator which then underwent additional reduction to form thefinal discriminator as described below.

Quantitative RealTime PCR: RT-PCR validation of array results wascarried out using the ABI TaqMan System as recommended, in an ABI 7900HTPCR System. Each sample was analyzed in duplicate and samples with CVsbetween replicates that were more than 0.5 delta Ct were repeated.

The results are reported below:

Clinical and demographic variables of the study samples (case andcontrol) are summarized in Table XVI above for 155 case patients and 91clinic controls including those with clinically diagnosed benignnodules. The groups were similar in terms of age, race, gender, andsmoking history. 84% of the clinical control group and 93% of the NSCLCgroup were current or previous smokers. These samples were all collectedat the University of Pennsylvania Medical Center. An additional 12patients, and 15 controls were used for external validation. Flowcytometry was performed on 35 cancer cases and 14 controls. There wereno significant differences in the percentages of T-cells, CD4 cells,B-cells, monocytes, or NK cells (data not shown). The tumor group had aslightly lower percentage of CD8 cells (18.9%) than the controls(24.5%), which did reach significance.

Gene expression profiles in PBMC samples from 137 patients with NSCLCwere compared to 91 controls with non-malignant lung disease(non-healthy controls, NHC) to determine whether consistent differencesin gene expression could be detected across the large data set. Geneexpression in PBMC were found to identify individuals with a lungcancer, e.g., NSCLC. Over 4500 of 48,000 probes (9%) were significantlychanged (two-tail t-test, p<0.05, false discovery rate 8%) between casesand controls. For comparison, data reported on lung tumors identified1649 of 12,600 transcripts (13%) which distinguish adenocarcinomas fromnormal lung tissue and 1886 (15%) which distinguish squamous cellcarcinoma from normal lung at the same significance. The fraction ofgenes changed in the PBMC of the average NSCLC patient is similar to thereported fraction of genes changed between the tumor and its normaltissue counterpart²⁰.

A support vector machine with recursive feature elimination (SVM-RFE)and 10-fold cross-validation were next used to find the minimal numberof genes which could distinguish the cancer and control groups fromtheir PBMC gene expression. The selection process of the 29 genes bySVM-RFE is described in detail as follows.

Data Pre-Processing/Expression Levels and Normalization:

Samples were processed as mixed batches (total of 12 batches) ofpatients and controls and hybridized to the Illumina WG-6v2 human wholegenome bead arrays. Raw data was processed by the Bead Studio v. 3.0software. Expression levels were exported for signal and negativecontrol probes. The set of negative control probes was used to calculateaverage background level for further filtering and backgroundsubtraction steps. Average values of the signal probe expression datafor the 137 patient (NSCLC) and 91 control (NHC) sample arrays (outliersremoved, see below) were used as a base for normalization and all thearrays, including 18 PRE/18 POST samples and NYU samples, were quantilenormalized against this base.

Array Quality Control.

After each hybridization batch, gene-wise global correlation wascomputed as a median Spearman correlation across all pairs ofmicroarrays from all batches using expression levels of all signalprobes (>48K). Median absolute deviation of the global correlation wasalso calculated. Then for each microarray a median spearman correlationagainst all other arrays was computed. The arrays whose mediancorrelation differs from global correlation more than 8 absolutedeviations (threshold was picked empirically) were marked as outliersand were not used for further analysis. 22 outliers were found atvarious stages, but 11 of these provided valid data on repeated arraysand these were included in the analysis.

Background Subtraction.

After quantile normalization the average background value (60, asdetermined for these data) was subtracted from each probe's expressiondata, which was then floored to one standard deviation of the background(15 for our data), the minimum expression value used in any calculation.

Probe Filtering.

Based on 137 patient and 91 control sample arrays, non-informativeprobes were defined to be probes that are not expressed at least 1.5times background (corresponds to expression value of 30 for backgroundsubtracted data) in more than 25% (57) of samples or probes that do notchange at least 1.2 fold between at least two samples. The data from allarrays was filtered by removing these non-informative probes, resultingin expression data of 15227 probes for analysis. These procedures resultin quantile normalized, outlier removed, background subtracted,non-informative probe filtered data, which were analyzed as follows:

The primary approach involved a classifier for a dataset trained usingthe SVM algorithm. Recursive Feature Elimination (RFE) strategy was usedto reduce number of genes required for the classification. Ten-foldcross-validation was employed to avoid data overfitting and provideunbiased estimation of the classifier accuracy. The trained classifierapplied to a sample provided a discriminant score that was used topredict one of two classes (malignant or non-malignant disease, pre orpost, etc.) for the sample.

Cross-Validation:

Ten-fold cross-validation with 10 resamples was used in theclassifications of NSCLC vs. NHC (including hold-out and permutationvalidations) and PRE vs. POST datasets. At each of 10 resample steps,data were randomly split into 10 parts (folds) while retaining theoriginal ratio of the two classes. Each fold was used as a testingsubset once while other 9 parts were used as training subsets. Thisresulted in 10 unique training-testing sets for each resample, andcombined with 10 resample steps, 100 unique combinations of 90% samplesused for training and 10% samples used for testing. This also ensuredthat each sample was involved in testing exactly 10 times. The testingwas done using classifiers that were not trained on the sample in anyway. A discriminant score for each sample was calculated as an averageof 10 scores predicted by classifiers that were not trained on a subsetincluding the sample.

RFE:

Each of 100 unique training-testing splits provided by cross-validationwas used by SVM-RFE independently. From the training subset, 1000 topgenes (features) ranked by p-value of t-test between the two classeswere retrieved. The classifier was trained using a linear kernel todistinguish between the classes using expression levels of those genes.The classifier was then applied to each sample from the testing subsetand discriminant scores were recorded. SVM-RFE then eliminated 10% ofthe remaining genes that had the smallest absolute coefficients in theclassifier's scoring function, i.e. those least important genes thataffect the final score the least. The process repeated (50 times) untilone gene is left for training.

Performance:

100 cross-validation steps of the SVM-RFE process produced for eachsample 10 prediction scores at each feature elimination iteration. Afinal sample score was computed as an average of these prediction scoresfor each set of genes tested, from 1000 to 1. Accuracy, sensitivity andspecificity of the classification were calculated based on final scoresof samples, using 0 as the classification threshold, i.e. samples withscores ≧0 were classified as the positive class, while samples withscores <0—as negative. Classifiers trained at such feature eliminationiteration that provided the best accuracy were selected, and a globalclassifier for all the samples consisted of the genes from each of the100 optimal classifiers. For example 100 cross-validation steps, eachwith maximum accuracy at about 8 genes, yielded a global classifier of136 genes for NSCLC vs. NHC (Table V above) experiment. A ROC curve wasbuilt varying classification threshold from maximum between samplescores to minimum.

Classifier Minimization:

To reduce the number of genes used by classifiers in allcross-validation steps, without retraining and with condition ofnon-reducing accuracy, unique genes that were involved in classificationfor a given RFE iteration across all cross-validation steps were rankedby their averaged absolute coefficients in the classifier's scoringfunction. The least important genes were removed one at a time from allscoring functions. The accuracy was recorded for each removal andminimum number of genes N that provided the same final classificationaccuracy M was used. The notation “N-gene classifier that has M %accuracy” based on these results was used.

Classifier Application:

For new samples not used in cross-validation, a classifier selected atthe accuracy maximum and then gene-minimized was applied. Thisclassifier was built from 100 sub-classifiers received at each step ofthe cross-validation for the selected RFE iteration. Final sample scorewas an average of 100 scores provided by those classifiers. Note, thatwhen applied to a sample that was used in the cross-validation, from 100sub-classifiers only 10 that were not trained on the sample were used.

137 NSCLC and 91 NHC samples were split into 5 parts. 1 part was used asa hold-out set and 4 parts were used as a dataset that was analyzedusing SVM-RFE with 10-fold, 10-resample cross-validation. The final bestN-gene classifier was then applied to the hold-out part.Cross-validation and hold-out accuracies were compared. 10 permutationdatasets were generated. Labels of 137 NSCLC and 91 NHC were shuffledrandomly and the data was analyzed using SVM-RFE with 10-fold,10-resample cross-validation. The final best accuracy N-gene classifierwas selected for each permutation and the accuracy was recorded. Averagepermutation accuracy across 10 runs was calculated.

Average cross-validation performance of SVM-RFE (figure not shown)indicated that on average, 8 genes were required for best accuracy ateach step during 100 cross-validation steps. The 100 steps resulted inthe 136 distinct genes reported in Table V above. The 136 genes thatprovided the best accuracy were further reduced to filter out as manygenes as possible without losing accuracy. Polynomial of power 5 was fitto the accuracy to detect the number of genes where the accuracy startsto decline (i.e., at 29 genes). The genes in Table V are ranked in orderby their contribution to the final classification score (the mostimportant gene ranking first, etc.). Alternative names and symbols arereferenced and the symbol “NaN” indicates that a symbol for the gene isnot yet available.

Classification scores were assigned by the 29 gene classifier to 137NSCLC patients and 91 patients with non-malignant lung disease. Apositive score indicated classification as cancer, a negative score asnon-malignant disease. Table XI lists the patient ID number, the classof disease (AC-adenocarcinoma, LSCC-lung squamous cell carcinoma,NSCLC—not further characterized, Non-Healthy control samples (NHC)patients with non-malignant lung disease: COPD: only chronic obstructivepulmonary disease, Benign Nodules: (determined by biopsy), Other:various types of lung diseases without defined COPD diagnosis), theclassification score of each patient, the standard error of the mean,the diagnosis, and the stage of cancer, if any.

TABLE XI Individual patient SVM scores from 29-gene NSCLC classifier IDClass Score Error Dx Stage NSCLC.1519 NSCLC 1.77 0.21 AC 3A NSCLC.1138NSCLC 1.65 0.07 LSCC 3B NSCLC.1471 NSCLC 1.64 0.32 NSCLC 3A NSCLC.1282NSCLC 1.54 0.26 AC 3B NSCLC.1154 NSCLC 1.54 0.23 AC 3A NSCLC.1222 NSCLC1.51 0.24 AC 1B NSCLC.1175 NSCLC 1.48 0.21 AC 1A NSCLC.1352 NSCLC 1.450.31 AC 1B NSCLC.1600 NSCLC 1.40 0.29 NSCLC 3B NSCLC.1647 NSCLC 1.390.23 LSCC 3B NSCLC.1280 NSCLC 1.38 0.30 LSCC 3B NSCLC.1311 NSCLC 1.360.15 AC 1A NSCLC.1200 NSCLC 1.35 0.26 AC 3A NSCLC.1602 NSCLC 1.35 0.22LSCC 1A NSCLC.1192 NSCLC 1.34 0.19 LSCC 1B NSCLC.1177 NSCLC 1.32 0.11 AC1B NSCLC.1583 NSCLC 1.32 0.22 LSCC 3A NSCLC.1397 NSCLC 1.32 0.34 AC 1ANSCLC.1362 NSCLC 1.30 0.11 AC 3B NSCLC.1403 NSCLC 1.30 0.18 AC 3BNSCLC.1307 NSCLC 1.29 0.30 AC 1A NSCLC.1559 NSCLC 1.27 0.14 AC 3ANSCLC.1589 NSCLC 1.26 0.19 AC 2B NSCLC.1155 NSCLC 1.25 0.17 AC 3ANSCLC.1211 NSCLC 1.23 0.23 AC 1A NSCLC.1631 NSCLC 1.23 0.18 AC 2BNSCLC.1475 NSCLC 1.21 0.17 LSCC 1A NSCLC.1437 NSCLC 1.20 0.28 LSCC 3ANSCLC.1484 NSCLC 1.15 0.17 LSCC 3A NSCLC.1166 NSCLC 1.15 0.35 AC 1BNSCLC.1674 NSCLC 1.14 0.09 AC 3A NSCLC.1454 NSCLC 1.13 0.19 LSCC 2BNSCLC.1316 NSCLC 1.12 0.28 AC 1B NSCLC.1569 NSCLC 1.11 0.21 NSCLC 3ANSCLC.1339 NSCLC 1.07 0.27 LSCC 2B NSCLC.1264 NSCLC 1.06 0.29 LSCC 4NSCLC.1325 NSCLC 1.05 0.12 NSCLC 3B NSCLC.1632 NSCLC 1.05 0.15 AC 2ANSCLC.1473 NSCLC 1.03 0.30 AC 1B NSCLC.1402 NSCLC 1.02 0.24 AC 4NSCLC.1557 NSCLC 1.01 0.23 NSCLC 1B NSCLC.1183 NSCLC 0.98 0.25 AC 1ANSCLC.1455 NSCLC 0.97 0.16 LSCC 1A NSCLC.1194 NSCLC 0.97 0.17 AC 4NSCLC.1193 NSCLC 0.96 0.20 AC 1B NSCLC.1224 NSCLC 0.96 0.13 AC 2ANSCLC.1573 NSCLC 0.94 0.14 AC 3B NSCLC.1375 NSCLC 0.94 0.25 NSCLC 1ANSCLC.1214 NSCLC 0.93 0.32 LSCC 1B NSCLC.1630 NSCLC 0.92 0.22 NSCLC 3ANSCLC.1343 NSCLC 0.92 0.20 AC 3A NSCLC.1561 NSCLC 0.91 0.21 LSCC 2ANSCLC.1435 NSCLC 0.89 0.25 AC 1A NSCLC.1221 NSCLC 0.88 0.32 AC 3ANSCLC.1449 NSCLC 0.87 0.14 LSCC 1A NSCLC.1413 NSCLC 0.85 0.21 LSCC 1BNSCLC.1287 NSCLC 0.84 0.20 AC 1B NSCLC.1387 NSCLC 0.84 0.21 AC 3ANSCLC.1140 NSCLC 0.83 0.21 AC 3B NSCLC.1598 NSCLC 0.83 0.31 AC 1ANSCLC.1415 NSCLC 0.78 0.20 AC 1A NSCLC.1369 NSCLC 0.77 0.21 AC 1BNSCLC.1591 NSCLC 0.75 0.10 AC 1A NSCLC.1469 NSCLC 0.75 0.25 AC 1ANSCLC.1141 NSCLC 0.75 0.23 AC 1B NSCLC.1340 NSCLC 0.74 0.37 AC 1ANSCLC.1178 NSCLC 0.73 0.13 LSCC 3B NSCLC.1604 NSCLC 0.73 0.21 AC 2BNSCLC.1429 NSCLC 0.70 0.15 LSCC 1A NSCLC.1681 NSCLC 0.67 0.26 NSCLC 3BNSCLC.1542 NSCLC 0.67 0.24 AC 1A NSCLC.1572 NSCLC 0.66 0.26 AC 1ANSCLC.1143 NSCLC 0.66 0.31 AC 1A NSCLC.1439 NSCLC 0.66 0.35 AC 3BNSCLC.1189 NSCLC 0.61 0.27 LSCC 3A NSCLC.1189 NSCLC 0.61 0.27 LSCC 3ANSCLC.1312 NSCLC 0.61 0.27 AC 2B NSCLC.1323 NSCLC 0.61 0.32 AC 4NSCLC.1466 NSCLC 0.61 0.30 LSCC 2B NSCLC.1643 NSCLC 0.59 0.21 AC 3BNSCLC.1550 NSCLC 0.58 0.21 AC 2B NSCLC.1423 NSCLC 0.55 0.26 LSCC 1BNSCLC.1468 NSCLC 0.54 0.19 LSCC 1A NSCLC.1167 NSCLC 0.54 0.31 AC 1ANSCLC.1436 NSCLC 0.54 0.31 AC 1A NSCLC.1368 NSCLC 0.53 0.16 AC 1ANSCLC.1158 NSCLC 0.52 0.41 AC 1A NSCLC.1137 NSCLC 0.51 0.26 AC 2BNSCLC.1656 NSCLC 0.51 0.12 AC 3A NSCLC.1592 NSCLC 0.50 0.20 LSCC 1BNSCLC.1489 NSCLC 0.48 0.29 AC 2A NSCLC.1566 NSCLC 0.47 0.21 LSCC 3BNSCLC.1284 NSCLC 0.45 0.25 LSCC 1A NSCLC.1204 NSCLC 0.43 0.31 LSCC 1ANSCLC.1400 NSCLC 0.43 0.33 LSCC 1A NSCLC.1622 NSCLC 0.42 0.42 NSCLC 1ANSCLC.1482 NSCLC 0.42 0.19 LSCC 1A NSCLC.1390 NSCLC 0.41 0.11 LSCC 2BNSCLC.1597 NSCLC 0.39 0.11 AC 3A NSCLC.1388 NSCLC 0.36 0.27 NSCLC 3BNSCLC.1444 NSCLC 0.35 0.23 AC 3A NSCLC.1463 NSCLC 0.35 0.22 LSCC 1ANSCLC.1586 NSCLC 0.34 0.29 LSCC 1A NSCLC.1233 NSCLC 0.30 0.28 LSCC 2ANSCLC.1713 NSCLC 0.29 0.22 AC 3B NSCLC.1344 NSCLC 0.29 0.28 AC 1BNSCLC.1171 NSCLC 0.27 0.35 LSCC 1A NSCLC.1590 NSCLC 0.25 0.18 AC 3ANSCLC.1196 NSCLC 0.25 0.26 LSCC 2B NSCLC.1451 NSCLC 0.24 0.22 AC 1BNSCLC.1709 NSCLC 0.24 0.23 LSCC 3B NSCLC.1560 NSCLC 0.23 0.30 AC 3ANSCLC.1584 NSCLC 0.19 0.44 AC 1A NSCLC.1269 NSCLC 0.18 0.23 LSCC 1ANSCLC.1595 NSCLC 0.17 0.23 LSCC 1B NSCLC.1286 NSCLC 0.16 0.25 AC 1ANSCLC.1202 NSCLC 0.14 0.31 AC 1B NSCLC.1292 NSCLC 0.13 0.22 LSCC 1BNSCLC.1491 NSCLC 0.12 0.17 AC 1B NSCLC.1373 NSCLC 0.09 0.23 AC 1BNSCLC.1303 NSCLC 0.09 0.20 LSCC 1A NSCLC.1614 NSCLC 0.08 0.28 LSCC 1BNSCLC.1337 NSCLC 0.05 0.31 AC 1A NSCLC.1453 NSCLC 0.02 0.15 AC 4NSCLC.1227 NSCLC 0.01 0.32 AC 1A NSCLC.1216 NSCLC −0.01 0.38 AC 1ANSCLC.1254 NSCLC −0.09 0.30 LSCC 1A NSCLC.1136 NSCLC −0.13 0.32 AC 1ANSCLC.1346 NSCLC −0.15 0.21 AC 2A NSCLC.1445 NSCLC −0.32 0.35 AC 2ANSCLC.1431 NSCLC −0.34 0.29 AC 1A NSCLC.1582 NSCLC −0.38 0.17 AC 1BNSCLC.1427 NSCLC −0.43 0.24 AC 1A NSCLC.1430 NSCLC −0.45 0.23 AC 1ANSCLC.1153 NSCLC −0.51 0.27 AC 1A NSCLC.1262 NSCLC −0.51 0.29 AC 1ANSCLC.1548 NSCLC −0.61 0.31 AC 1B NSCLC.1386 NSCLC −0.65 0.22 AC 1BNHC.1218  NHC 1.13 0.36 GI 0 NHC.1588  NHC 0.96 0.31 GI 0 NHC.1146  NHC0.80 0.23 HAM 0 NHC.10062 NHC 0.77 0.33 COPD 0 NHC.1554  NHC 0.72 0.20NM 0 NHC.10027 NHC 0.60 0.30 COPD 0 NHC.1474  NHC 0.59 0.19 NM 0NHC.1628  NHC 0.51 0.37 GI 0 NHC.10010 NHC 0.48 0.29 HTN 0 NHC.1263  NHC0.48 0.21 NM 0 NHC.1619  NHC 0.45 0.10 GI 0 NHC.1361  NHC 0.42 0.27 NM 0NHC.1575  NHC 0.38 0.19 GI 0 NHC.1522  NHC 0.21 0.12 GI 0 NHC.1562  NHC0.11 0.27 NM 0 NHC.10047 NHC 0.11 0.31 COPD 0 NHC.1424  NHC 0.04 0.21 GI0 NHC.10037 NHC 0.02 0.32 COPD 0 NHC.10063 NHC −0.01 0.22 COPD 0NHC.1677  NHC −0.05 0.15 GI 0 NHC.10044 NHC −0.16 0.23 SARC 0 NHC.1260 NHC −0.16 0.25 NM 0 NHC.1182  NHC −0.23 0.38 PN 0 NHC.10043 NHC −0.250.31 COPD 0 NHC.10064 NHC −0.29 0.29 COPD 0 NHC.1148  NHC −0.30 0.35 GI0 NHC.1184  NHC −0.30 0.26 NM 0 NHC.1618  NHC −0.33 0.20 GI 0 NHC.10046NHC −0.33 0.15 COPD 0 NHC.1657  NHC −0.37 0.25 SARC 0 NHC.10034 NHC−0.44 0.24 COPD 0 NHC.10036 NHC −0.45 0.21 COPD 0 NHC.10058 NHC −0.470.23 COPD 0 NHC.10054 NHC −0.49 0.20 COPD 0 NHC.10028 NHC −0.50 0.14COPD 0 NHC.10004 NHC −0.52 0.32 PS 0 NHC.10040 NHC −0.53 0.20 COPD 0NHC.1442  NHC −0.56 0.32 NM 0 NHC.1438  NHC −0.61 0.25 NM 0 NHC.10038NHC −0.63 0.20 COPD 0 NHC.1488  NHC −0.64 0.16 GI 0 NHC.10042 NHC −0.650.22 COPD 0 NHC.1594  NHC −0.66 0.17 GI 0 NHC.1186  NHC −0.66 0.36 NM 0NHC.1399  NHC −0.66 0.29 GI 0 NHC.1191  NHC −0.68 0.27 NM 0 NHC.10048NHC −0.69 0.30 COPD 0 NHC.10061 NHC −0.69 0.35 COPD 0 NHC.10049 NHC−0.70 0.28 COPD 0 NHC.10055 NHC −0.70 0.25 COPD 0 NHC.10023 NHC −0.740.17 CR 0 NHC.1242  NHC −0.74 0.27 NM 0 NHC.10003 NHC −0.77 0.34 HTN 0NHC.10039 NHC −0.80 0.22 COPD 0 NHC.1697  NHC −0.84 0.14 GI 0 NHC.1309 NHC −0.86 0.25 NM 0 NHC.1305  NHC −0.92 0.19 GI 0 NHC.1185  NHC −0.930.21 NM 0 NHC.1289  NHC −0.94 0.28 NM 0 NHC.1277  NHC −0.94 0.27 NM 0NHC.10029 NHC −0.95 0.21 COPD 0 NHC.10053 NHC −0.97 0.18 COPD 0NHC.1616  NHC −1.00 0.11 NM 0 NHC.10030 NHC −1.03 0.25 SARC 0 NHC.10019NHC −1.07 0.10 NHC 0 NHC.10035 NHC −1.07 0.14 COPD 0 NHC.10051 NHC −1.080.19 COPD 0 NHC.10013 NHC −1.08 0.28 COPD 0 NHC.1251  NHC −1.09 0.19 GI0 NHC.10008 NHC −1.11 0.28 GI 0 NHC.10018 NHC −1.13 0.15 COPD 0NHC.10012 NHC −1.21 0.21 COPD 0 NHC.1342  NHC −1.22 0.21 GI 0 NHC.10052NHC −1.25 0.25 COPD 0 NHC.10041 NHC −1.27 0.18 COPD 0 NHC.10031 NHC−1.32 0.27 COPD 0 NHC.1490  NHC −1.34 0.15 NM 0 NHC.1250  NHC −1.37 0.26NM 0 NHC.10005 NHC −1.40 0.13 CR 0 NHC.1267  NHC −1.43 0.12 NM 0NHC.10057 NHC −1.52 0.27 COPD 0 NHC.1450  NHC −1.56 0.34 GI 0 NHC.10001NHC −1.56 0.16 HTN 0 NHC.10022 NHC −1.57 0.20 COPD 0 NHC.10059 NHC −1.650.15 COPD 0 NHC.1328  NHC −1.65 0.14 NM 0 NHC.1314  NHC −1.68 0.20 GI 0NHC.10050 NHC −1.82 0.19 COPD 0 NHC.10033 NHC −1.83 0.20 COPD 0NHC.10032 NHC −1.89 0.15 COPD 0 NHC.10056 NHC −2.45 0.10 COPD 0

Example 15 Independent Validation Studies on Hold-Out Samples

To address issues of data over-fitting and to test the generality of theclassification model before applying it to new samples, the analysis wasre-performed, setting aside 20% of the patient and control samplesincluding representatives of each of the subclasses for validation andtraining on the remaining 80%. 5 separate and non-overlapping holdoutsets were subject to this revalidation. The average accuracy over the 5validation sets was 81% as compared to an average accuracy of 82% forthe 5 training sets (data not shown). The similar accuracy of thetraining and validation sets demonstrated the ability of the algorithmto classify new samples with predicted accuracy. The slightly loweraccuracy with the hold-out sets compared to cross validation using allof the data (81% vs. 86%) was a reflection of the smaller number ofsamples available for training. By contrast the average accuracy of theanalysis with permuted sample labels was only 58% across 10 permutationruns. It was concluded that the 29 gene signature of Table V candistinguish patients with either of the two main NSCLC subtypes and anyof the four NSCLC tumor stages, from patients with other smoking-relatedbut non-malignant lung diseases.

Example 16 Classification Accuracy for Patient and Control SubclassesUsing 29 Genes

The accuracy of the 29 gene classifier was examined for the differenttypes of patients and controls in the data set. Table XII below liststhe accuracies for the 29 genes in identifying the various patient andcontrol classes as well as for increasing pathological tumor stages. Theindividual classification accuracies for AC or LSCC alone were 86% and98% respectively as compared to 91% for the combined patients. Therewere half as many LSCC in the dataset, but they were classified withsignificantly higher accuracy.

Lines 7-12 of Table XII showed an incremental increase in classificationaccuracy from Stage 1A (83%) to stages 3 and 4 (100%), supporting thatthe PBMC cancer signature becomes more pronounced with progressivedisease. If only the controls with confirmed COPD and no evidence oflung nodules were considered, they classified with an accuracy of 89%,while patients with confirmed benign nodules (regardless of COPD status)had a classification accuracy of 71%. Thus, classification accuracy wasinfluenced by cancer stage,

TABLE XII Performance of 29 gene classifier on subclasses of patientsand controls. # Subclass Accuracy by Class Number of Samples  1 NSCLC 91% 137  2 NHC  80% 91  3 AC  86% 85  4 LSCC  98% 42  5 Nodules  71% 41 6 COPD  89% 38  7 Stage 1A  83% 48  8 Stage 1B  89% 27  9 Stage 1  85%75 10 Stage 2  89% 18 11 Stage 3 100% 39 12 Stage 4 100% 5

Although 29 genes were sufficient to distinguish patient and controlclasses, many more statistically significant genes were differentiallyexpressed (see Table V). Molecular functions most highly representedincluded, regulation of gene expression, cell death and cell growth anddifferentiation. Genes associated with the generation of memory T-cells,T-cell accumulation and mobilization of NK cells were mostly up incancer, while B-cell receptor signaling pathways were down. Genesassociated with activation or chemotaxis of myeloid cells andgluco-corticoid receptor signaling genes were overwhelmingly down in thecancer patients.

The clinical application of the PBMC gene expression signature is clear.Assuming a lung cancer prevalence of 5% for patients with a lung nodulebetween 0.5 and 3.0 cm, the 29-gene classifier (with a cut-off value ofzero) is anticipated to achieve a positive (PPV) and negative predictivevalue (NPV) of 0.19 and 0.99 respectively, as shown in Table XIII below.These values exceed those established by the EDRN Lung Cancer BiomarkerGroup that determines if a biomarker is to be considered useful foradditional study. These are similar to values for the 80 gene expressionpanel from bronchial brushings recently described¹⁸. Importantly, evenhigher clinical utility could be achieved in many patients by takingadvantage of the actual value of the predictive score rather than usinga strict positive or negative score cut-off. In the large dataset shownin Table XI above, no subject with an SVM score less than −0.65 had lungcancer and only 5 of 91 non-cancer control patients had an SVM scoreof >+0.65 were classified as lung cancer. Thus, the actual value of theSVM score is useful for determining which patients require an invasiveintervention as opposed to a more conservative approach, such as serialCT imaging.

TABLE XIII Positive predictive value and negative predictive value for29-gene NSCLC classifier. Study Sensitivity Specificity Prevalence PPVNPV NSCLC vs. NHC 0.91 0.8  1% 0.044 0.999 29 gene classifier 5% 0.1930.994 Spira et al., 2007 0.8  0.84 1% 0.048 0.998 80 gene classifier 5%0.208 0.988 LCGB 0.8  0.7  1% 0.026 0.997 Proposed biomarker 5% 0.1230.985

Example 17 Classification of Patient and Control Samples from anIndependent Site

All of the samples used to develop and validate the 29 gene panel werecollected at the Hospital of the University of Pennsylvania. To furthervalidate the utility of the classifier we analyzed 27 samples collectedat the NYU Lung Cancer Biomarker Center, an Early Detection ResearchNetwork (EDRN) Clinical and Epidemiologic Validation Center. The 27samples included 12 Stage 1 NSCLC (5 of which were never smokers), and15 smoker and ex-smokers controls, including 6 controls diagnosed byserial CT scans as having non-malignant Ground Glass Opacities (GGO)²¹.No GGO samples were included in our original training set.

Despite the differences in collection sites, sample processing and thedifferent control population, the 27 samples were classified with anoverall accuracy of 74% (20 of 27), sensitivity of 67% (8 of 12) andspecificity of 80% (12 of 15). The SVM classification is shown in detailin TABLE XIV below.

TABLE XIV SVM classification scores by NSCLC classifier for NYUvalidation samples ID Class Score Error Dx NYU.1  NSCLC 1.07 0.06 ACNYU.2  NSCLC 1.01 0.07 AC NYU.3  NSCLC 0.95 0.06 AC NYU.4  NSCLC 0.810.07 AC NYU.5  NSCLC 0.71 0.08 AC NYU.6  NSCLC 0.48 0.06 AC NYU.7  NSCLC0.29 0.08 AC NYU.8  NSCLC 0.18 0.09 AC NYU.9  NSCLC −0.25 0.09 AC NYU.10NSCLC −0.29 0.10 AC NYU.11 NSCLC −0.37 0.10 AC NYU.12 NSCLC −0.94 0.08AC NYU.13 NHC 1.16 0.10 GGO NYU.14 NHC 0.70 0.11 N NYU.15 NHC 0.69 0.10GGO NYU.16 NHC −0.12 0.08 N NYU.17 NHC −0.13 0.09 GGO NYU.18 NHC −0.260.08 N NYU.19 NHC −0.39 0.09 GGO NYU.20 NHC −0.39 0.08 N NYU.21 NHC−0.46 0.10 N NYU.22 NHC −0.52 0.10 N NYU.23 NHC −0.58 0.07 N NYU.24 NHC−0.73 0.09 N NYU.25 NHC −0.75 0.10 N NYU.26 NHC −0.84 0.09 GGO NYU.27NHC −0.94 0.08 GGO Dx abbreviations: AC = adenocarcinoma, N = normal,GGO = ground glass opacities

Two of the misclassified patients were never smokers and 2 of thecontrols were GGOs. The reduced accuracy in the external validation setwas most likely due to the differences in the processing of the samples(data not shown).

Example 18 29 Gene Classification of Independent Samples Before andafter Tumor Removal

The 29 gene classifier was tested on an independent set of 36 samplesfrom 18 NSCLC patients that included both pre- and post-resectionsamples. First, as further validation, when using this classifier,fourteen of 18 pre-surgery samples correctly classified as cancer, for asensitivity of 78%. Second, the SVM scores for 13 of the 14 (92%) showedsignificant decreases in the classification score after surgicalresection. Seven of the post-resection samples had SVM scores that werenegative and classified as non-cancer samples in this analysis (data notshown). There was no obvious correlation between the change in the SVMscores and the time of post-resection PBMC collection, although the dataset is relatively small

Gene expression profiles change in PBMC after tumor removal, asdemonstrated below. The analysis shown in FIG. 5 of the pre/post pairedsamples was carried out to determine whether the 29 gene classifierdeveloped on patients with malignant vs. non-malignant disease woulddetect a difference in gene expression after the removal of the tumor.Given the observation that this was true for the majority of thesamples, the extent of the differences between the sample classes wasexamined. The sample pairs were directly compared to further assesschanges in gene expression that might result from removing the tumor. Asignificant effect on PBMC gene expression was found; 2060 genes werefound to be differentially expressed across the pairs (paired two-tailt-test, p<0.05 with a false discovery rate of 28%).

A separate SVM classifier for the pre- and post-surgery patients wasgenerated and the 50 genes forming that classifier set is reported inTable VI above. One classifier selected from the genes of Table VI wasable to perfectly separate the two classes with as few as four genes.The top ranking four genes in this classifier include CYP2R1 (amicrosomal vitamin D hydroxylase), MYO5B (mitochondrial3-oxoacyl-Coenzyme A thiolase), DGUOK (Mitochondrial DeoxyguanosineKinase), all down-regulated post-surgery and DNCL1 (Dynein, cytoplasmic,light chain 1) which is up-regulated after surgery. Two (CYP2R1 andDGUOK) of the 4 genes were also validated by Quantitative Realtime PCRon 10 sample pairs. The results are indicated in FIG. 6 and Table XVbelow.

TABLE XV PRE/POST PBMC surgery expression ratios for 10 patients asdetermined by Illumina gene expression arrays and QPCR analysis. CYP2R1DGUOK IIlumina Illumina Patient arrays PCR arrays PCR  4 1.13 1.33 1.331.21  5 1.55 1.28 1.12 1.01  6 1.49 1.73 1.21 1.41  7 1.33 1.06 1.121.17 11 1.44 1.58 1.38 1.09 14 1.37 1.29 1.30 1.25 15 1.15 2.65 1.142.14 16 1.42 0.96 1.19 0.76 17 1.60 1.57 1.21 1.56 18 1.10 1.09 1.311.15 AVERAGE 1.36 1.45 1.23 1.27

Example 19 Gene Expression Signature for Differentiation of Patientswith Benign Lung Nodules

Since the patients with diagnosis of a benign nodule are the mostimportant control class for differentiation, a separate classifier wasdeveloped using only the controls with benign nodules and its accuracyassessed. Using the 41 controls with nodules and a randomly selectedgroup of 54 NSCLC samples, SVM-RFE with cross validation was applied, asdescribed above. The resulting classifier (Table VII, genes 1-24) was79% accurate, with a specificity of 80% for the nodules and requires asfew as 24 genes, 7 of which were included in the 29 gene panel. TableVII lists the rank of the gene “RANK” in NSCLC vs. NHC classifier, theIllumina Spot ID “ID”, the Accession No. “Acc. No.”, the description ofthe gene, its symbol, the NSCLC vs. GI.NM p-value “p-value”, and theNSCLC/GI.NM fold change “Fold Chg”.

H. REFERENCES

-   1. Yousef, M., et al., 2007 BMC Bioinformatics, 8: p. 144.-   2. Jemal, A., et al., 2006 J Clin 56(2): p. 106-30.-   3. Marcus, P. M., et al., 2000 J Natl Cancer Inst, 92(16): p.    1308-16.-   4. Palmisano, W. A., et al., 2000 Cancer Res, 60(21): p. 5954-8.-   5. Patz, E. F., Jr., et al 2000 N Engl J Med, 343(22): p. 1627-33.-   6. Hirsch, F. R., et al., 2001 Clin Cancer Res, 7(1): p. 5-22.-   7. Burczynski M E, et al., 2005 Clin Cancer Res., 11(1181-9).-   8. Burczynski, M. E., et al., 2005 Curr Mol Med, 5(1): p. 83-102.-   9. Chang, H. Y., et al., 2002 Proc Natl Acad Sci USA, 99(20): p.    12877-82.-   10. Borczuk, A. C., et al., 2003 Am J Pathol, 163(5): p. 1949-60.-   11. Gao, C., et al., 2005 Nitric Oxide, 12(2): p. 121-6.-   12. Mulshine, J. L., 2005 Oncology (Williston Park), 19(13): p.    1724-30; disc. 30-1.-   13. Haiman, C. A., et al., 2006 N Engl J Med, 354(4): p. 333-42.-   14. Diederich, S. and D. Wormanns, 2004 Lung Cancer 45 Suppl 2: p.    S13-9.-   15. Jett, J. R., 2005 Clin Cancer Res, 11(13 Pt 2): p. 4988s-4992s.-   16. Deppermann, K. M., 2004 Lung Cancer, 45 Suppl 2: p. S39-42.-   17. MacMahon, H., et al., 2005 Radiology, 237(2): p. 395-400.-   18. Berger, M., et al, 2003 AJR Am J Roentgenol, 2003. 181(2): p.    359-65.-   19. Mulshine, J. L., 2005 Clin Cancer Res, 11(13 Pt 2): p.    4993s-4998s.-   20. Bhattacharjee, A., et al., 2001 Proc. Natl. Acad. Sci, USA,    98:13790-13795-   21. Burczynski, M. E. and A. J. Dorner, 2006 Pharmacogenomics,    7(2): p. 187-202.-   22. Chaussabel, D., et al., 2005 Ann N Y Acad Sci, 2005. 1062: p.    146-54.-   23. Burczynski M E, et al., 2005 J. Mol Diagn., 2005. 8(51-61).-   24. Deng M C, et al., 2006 Am J Transplant., 6: p. 150-160.-   25. Achiron, A., et al., 2005 Breast Cancer Res Treat, 89(3): p.    265-70.-   26. Achiron, A. and M. Gurevich, 2006 Autoimmun Rev, 5(8): p.    517-22.-   27. Goronzy, J. J., et al., 2004 Arthritis Rheum, 2004. 50(1): p.    43-54.-   28. Bull T M, et al, 2006 Am J Respir Crit Care Med., 4(170): p.    911-919.-   29. Achiron, A., et al., 2007 Ann N Y Acad Sci, 1107: p. 155-67.-   30. Sharp, F. R., et al., 2006 Arch Neurol, 63(11): p. 1529-1536.-   31. Forrest, M. S., et al., 2005 Environ Health Perspect, 113(6): p.    801-7.-   32. Theodoro, T. R., et al., 2007 Neoplasia, 9(6): p. 504-10.-   33. Karimi, K., et al., 2006 Respir Res, 7: p. 66.-   34. van Leeuwen, D. M., et al., 2007 Carcinogenesis, 28(3): p.    691-7.-   35. Oudijk, E. J., et al., 2005 Thorax, 60(7): p. 538-44.-   36. Lampe, J. W., et al., 2004 Cancer Epidemiol Biomarkers Prev,    13(3): p. 445-53.-   37. Spira, A., et al., 2004 Proc Natl Acad Sci USA, 101(27): p.    10143-8.-   38. Russo, A. L., et al., 2005 Clin Cancer Res, 11(7): p. 2466-70.-   39. Kari, L., et al., 2003 J Exp Med, 197(11): p. 1477-88.-   40. Talmadge, J. E., et al., 1996 Bone Marrow Transplant, 17(1): p.    101-9.-   41. Redente, E. F., et al., 2007 Am J Pathol, 170(2): p. 693-708.-   42. Twine, N., et al., 2003 Cancer Res., 6: p. 6069-75.-   43. Sharma, P., et al., 2005 Breast Cancer Res, 7: p. 634-44.-   44. DePrimo, S. E., et al., 2003 BMC Cancer, 3: p.    http://www.biomedcentral.com/1471-2407/3/3.-   45. Eady, J. J., et al., 2005 Physiol Genomics, 22(3): p. 402-11.-   46. Whitney, A. R., et al., 2003 Proc Natl Acad Sci USA, 100(4): p.    1896-901.-   47. Loboda, A., et al., 2003 Proc. Eur. Conf. on Computational    Biology, GE-19,: p. p 383-84.-   48. Guyon, I., et al., 2002 Machine Learning, 46(1-3): p. 389-422.-   49. Critchley-Thorne, R. J., et al., 2007 PLoS Med, 4(5): p. e176.-   50. Vachani, A., et al., 2007 Clin. Canc. Res., 13(10): p.    2905-2915.-   51. Spira, A., et al., 2007 Nat Med, 13(3): p. 361-6.-   52. Mukherjee, S., et al., 2003 J Comput Biol, 10(2): p. 119-42.-   53. Wang, J., et al., 2007 Bioinformatics, 23(15): p. 2024-7.-   54. Vapnik, V., 1999., The Nature of Statistical Learning Theory.    Springer-Verlag, 1999. ISBN 0-387-98780-0.-   55. Nebozhyn, M., et al., 2006 Blood, 107(8): p. 3189-96.-   56. Marron, J. and M. Todd (2003) Distance Weighted Discrimination    School of Operations Research and Industrial Engineering, Cornell    University-   57. Virok, D., et al., 2003 J Infect Dis, 188(9): p. 1310-21.-   58. Pepe, M. S., et al., 2003 Biometrics, 59(1): p. 133-42.-   59. DeLong, E. R., et al 1988 Biometrics, 44(3): p. 837-45.-   60. Harrell, F. E., Jr., et al., WHO/ARI Young Infant Multicentre    Study Group. Stat Med, 1998. 17(8): p. 909-44.-   61. Benito, M., et al., 2004 Bioinformatics, 20(1):105-114-   62. Chung, G T., et al., 1995 Oncogene, 11:2591-2598-   63. Hirano, T., et al., 1994 Am J. Pathol., 144:296-302-   64. Kishimoto, Y., et al., J Natl Cancer Inst, 1995 87:1224-1229-   65. Tibshirani, R., et al., Proc Natl Acad Sci USA, 2002    99:6567-6572-   66. Tonon, G., et al., Proc Natl Acad Sci, 2005 102:9625-9630-   67. MacQueen, J. Proceedings of the Fifth Berkeley Symposium on    Mathematical Statistics and Probability. University of California    Press; 1967. Some methods for classification and analysis of    multivariate observations; pp. 281-297.-   68. Talbot, S G, et al. Cancer Res. 2005; 65:3063-3071.-   69. Ausubel et al., Current Protocols in Molecular Biology, Wiley    Interscience Publishers, (1995).-   70. Sambrook et al., Molecular Cloning: A Laboratory Manual, New    York: Cold Spring Harbor Press, 1989-   71. B. Lewin. Genes IV Cell Press, Cambridge Mass. 1990-   72. Singleton et al., Dictionary of Microbiology and Molecular    Biology 2nd ed., J. Wiley & Sons (New York, N.Y. 1994)-   73. March, Advanced Organic Chemistry Reactions, Mechanisms and    Structure 4th ed., John Wiley & Sons (New York, N.Y. 1992)-   74. Parker & Barnes, 1999 Methods in Molecular Biology 106:247-283-   75. Hod, 1992 Biotechniques 13:852 854-   76. Weis et al., 1992 Trends in Genetics 8:263 264-   77. Ausubel et al., Current Protocols of Molecular Biology, John    Wiley and Sons (1997)-   78. Rupp and Locker, 1987 Lab Invest. 56:A67-   79. De Andres et al., 1995 BioTechniques 18:42044-   80. T. E. Godfrey et al. 2000 J. Molec. Diagnostics 2: 84 91-   81. K. Specht et al., 2001 Am. J. Pathol. 158: 419-29-   82. Ding and Cantor, 2003 Proc. Natl. Acad. Sci. USA 100:3059-3064-   83. U.S. Pat. No. 7,081,340-   84. International Patent Application Publication No WO 2004/105573,    published Dec. 9, 2004-   85. Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols:    Methods in Molecular Biology. Humana Press, Totowa, N.J., pp    365-386)-   86. Dieffenbach, C. W. et al., “General Concepts for PCR Primer    Design” in: PCR Primer, A Laboratory Manual, Cold Spring Harbor    Laboratory Press, New York, 1995, pp. 133-155-   87. Innis and Gelfand, “Optimization of PCRs” in: PCR Protocols, A    Guide to Methods and Applications, CRC Press, London, 1994, pp. 5-11-   88. Plasterer, T. N. 1997 Methods Mol. Biol. 70:520 527-   89. Golub T R, et al 1999 Science. 286:531-537-   90. ACS. Cancer Facts and Figures 2007. Atlanta: American Cancer    Society; 2008.-   91. Amos C I, et al. 2008 Nat Genet, 40:616-22.-   92. Kang J U, et al, 2008 Cancer Genet Cytogenet, 184:31-7.-   93. Thorgeirsson T E, et al., 2008 Nature, 452:638-42.-   94. Henschke C I, et al, 2006 N Engl J Med, 355:1763-71.-   95. Bach P B, 2007 JAMA, 297:953-61.-   96. Ikeda K, et al, 2007 Chest, 132:984-90.-   97. Machida E O, et al. 2006 Cancer Res, 66:6210-8.-   98. Patz E F, Jr. et al, 2007 J Clin Oncol, 25:5578-83.-   99. Yanagisawa K, et al. 2003 Lancet, 362:433-9.-   100. Brichory F M, et al. 2001 Proc Natl Acad Sci USA, 98:9824-9.-   101. Pontes E R, et al. 2006 Prostate, 66:1463-73.-   102. Belinsky S A, et al. 2006 Cancer Res, 66:3338-44.-   103. Ohta Y, et al. 2006 Ann Thorac Surg, 81:1194-7.-   104. Osman I, et al. 2006 Clin Cancer Res, 12:3374-80.-   105. Subramanian J, Govindan R. 2007 J Clin Oncol, 25:561-70.-   106. Sun S, et al, 2007 Nat Rev Cancer, 7:778-90.-   107. Hung R J, et al. 2008 Nature, 452:633-7.-   108. Mashima T, Tsuruo T. 2005 Drug Resist Updat, 8:339-43.-   109. Ozoren N, El-Deiry W S. 2003 Semin Cancer Biol, 13:135-47.-   110. Held et al., Genome Research 6:986 994 (1996).

Each and every patent, patent application, and publication, includingthe priority application and publically available gene sequence citedthroughout the disclosure, including the priority applications U.S.patent application Ser. No. 13/914,902, U.S. patent application Ser. No.12/745,991, International patent application No. PCT/US2008/013450, andU.S. provisional application No. 61/005,569, is expressly incorporatedherein by reference in its entirety. While this invention has beendisclosed with reference to specific embodiments, it is apparent thatother embodiments and variations of this invention are devised by othersskilled in the art without departing from the true spirit and scope ofthe invention. The appended claims include such embodiments andequivalent variations.

What is claimed is:
 1. (canceled)
 2. A composition for diagnosing theexistence or evaluating the progression of a lung cancer in a mammaliansubject, said composition comprising three or more polynucleotides,oligonucleotides or ligands, wherein each polynucleotide,oligonucleotide or ligand hybridizes to a different gene, gene fragmentor gene transcript selected from: i. AKR1C3; ii. ATP5B; iii. C15orf39;iv. CLIC3; v. CST7; vi. CTDSP2; vii. DDIT4; viii. DGUOK; ix. EIF2B4; x.FRAT2; xi. GZMB; xii. HAVCR2; xiii. HLA-DMB; xiv. KCTD12; xv. LYN; xvi.MID1IP1; xvii. MS4A6A; xviii. MYADM; xix. NAGK; xx. RAB10; xxi. RBM14;xxii. RXRA; xxiii. S100A8; xxiv. SAMSN1; and xxv. SH2D3C.
 3. Thecomposition according to claim 2, wherein each polynucleotide,oligonucleotide or ligand hybridizes to a mRNA transcript.
 4. Thecomposition according to claim 2, which is a reagent comprising asubstrate upon which said polynucleotides, oligonucleotides or ligandsare immobilized.
 5. The composition according to claim 2, comprising amicroarray, a microfluidics card, a chip or a chamber.
 6. Thecomposition according to claim 2, which is a kit containing said threeor more polynucleotides or oligonucleotides or ligands.
 7. Thecomposition according to claim 6, wherein said polynucleotides oroligonucleotides are each part of a primer-probe set, and said kitcomprises both primer and probe, wherein each said primer-probe setamplifies a different gene, gene fragment or gene expression product. 8.The composition according to claim 2, wherein said polynucleotides,oligonucleotides or ligands are attached to silica beads.
 9. Thecomposition according to claim 2, wherein one or more polynucleotide oroligonucleotide or ligand is associated with a detectable label.
 10. Thecomposition according to claim 2, wherein the lung cancer is a non-smallcell lung cancer.
 11. The composition according to claim 2, wherein saidselected genes comprise 4 to 25 genes of any of (i) to (xxv).
 12. Thecomposition according to claim 2, wherein the selected genes comprise atleast 10 genes of any of (i) to (xxv).
 13. The composition according toclaim 2, wherein the selected genes comprise at least 15 genes of any of(i) to (xxv).
 14. The composition according to claim 2, wherein theselected genes comprise at least 20 genes of any of (i) to (xxv). 15.The composition according to claim 2, wherein the selected genescomprise at least 24 genes of any of (i) to (xxv).
 16. The compositionaccording to claim 2, wherein the selected genes comprise all 25 genesof any of (i) to (xxv).
 17. The composition according to claim 2,wherein the selected genes comprise DDIT4 and GZMB.
 18. A method fordiagnosing the existence or evaluating a lung cancer in a mammaliansubject comprising identifying changes in the expression of three ormore genes from the whole blood of said subject as compared to thelevels of the same genes a reference or control, said genes selectedfrom: i. AKR1C3; ii. ATP5B; iii. C15orf39; iv. CLIC3; v. CST7; vi.CTDSP2; vii. DDIT4; viii. DGUOK; ix. EIF2B4; x. FRAT2; xi. GZMB; xii.HAVCR2; xiii. HLA-DMB; xiv. KCTD12; xv. LYN; xvi. MID1IP1; xvii. MS4A6A;xviii. MYADM; xix. NAGK; xx. RAB10; xxi. RBM14; xxii. RXRA; xxiii.S100A8; xxiv. SAMSN1; and xxv. SH2D3C; and diagnosing one or more of adiagnosis of a lung cancer, a diagnosis of a stage of lung cancer, adiagnosis of a type or classification of a lung cancer, a diagnosis ordetection of a recurrence of a lung cancer, a diagnosis or detection ofa regression of a lung cancer, a prognosis of a lung cancer, or anevaluation of the response of a lung cancer to a surgical ornon-surgical therapy.
 19. The method according to claim 18, wherein thelevels of gene expression are measured by mRNA levels.
 20. The methodaccording to claim 18, further comprising stabilizing the mRNA in thewhole blood sample of the subject prior and measuring mRNA levels of thethree or more genes prior to identifying changes in the expression.